How to integrate Amazon S3 with the Denodo Platform

Applies to: Denodo 7.0 , Denodo 6.0 , Denodo 5.5
Last modified on: 05 Jul 2018
Tags: Amazon S3 XML data sources DF data sources

Download document

Introduction

Amazon Simple Storage Service (Amazon S3), is a service that provides to developers and IT teams a secure, durable and highly-scalable object storage.

Amazon S3 stores data as objects within resources called "buckets”. It is possible to store as many objects as needed within a bucket. As well as write, read, and delete objects in a bucket.

There are several ways to integrate this service. The AWS Management Console provides a web-based interface for accessing and managing all available Amazon S3 resources. Programmatic access to Amazon S3 is enabled by the AWS SDKs and the AWS SDK Mobile. In addition, Amazon S3 provides a full REST API which will be the studied feature in this document.

In addition, HDFS Custom Wrappers available at the Support Site also support S3 routes, so they can be used to access directly data in S3.

How to connect to AWS S3 from Denodo?

Summary of options

  1. (Recommended) HDFS Custom Wrappers available at the Denodo Support Site are able to access delimited files, as well as non-standard files (Avro, Map files, Sequence files) stored in HDFS and AWS S3. This option allows to access delimited files (CSV), Avro, Map files, Sequence files and Parquet files. Any other file format stored in S3, such as JSON or XML can’t be accessed through this custom wrappers.

  1. Through Amazon S3 REST API. This option allows to access file types not supported by the HDFS Custom Wrappers, such as JSON or XML. On the other hand, this method is more manual as Amazon S3 REST API has a proprietary authentication schema that must be provided in a Authentication Header

Authorization: AWS AWSAccessKeyId:Signature

This document is describes the way to access file types not supported by the HDFS Custom Wrappers, such as JSON or XML, through the Amazon S3 REST API.

Amazon S3 REST API

In this section, we will introduce the Amazon S3 REST API and describe some of its most useful features.

Requests to Amazon S3 can be authenticated or anonymous. In case authenticated requests are used, credentials will be required so AWS can authenticate the requests. When making a REST API call a signature has to be created using the user credentials and then included in the request. Being correctly authenticated is not enough to access to an Amazon S3 resource, the user still needs to have enough permissions on the resource to access.

In the following sections we will describe some of the main requirements for authenticated request and two of the most important authenticated Amazon S3 REST API methods: “Get Bucket” and “Get Object”.

For more information about the REST API, see the API Reference for Amazon S3.

Authenticated requests requirements.

Authentication requirements.

Authenticated requests in Amazon S3 need to send authentication information in order to be performed. This section introduces request authentication with the “AWS Signature version 4” algorithm.

The authentication information sent in a request must include a “signature” that is calculated following a detailed process. In “AWS Signature version 4”, the basic signing requests process is composed of the following steps:

  1. Obtain a “string to sign”: it is necessary to compose a string concatenating a subset of elements from the request to be signed in the last step.
  2. Obtain a “signing key”: the “secret access key” provided by AWS is used to calculate the “signing key” that will be used to sign the string obtained in the previous step.
  3. Obtain the “signature”: it is the result of the signing process that signs the string obtained in the step a) using the key calculated in the step b).

The authentication method used in the examples, that will be described in this document, is the “Authorization Header” method which consists in adding an Authorization header to the request containing the calculated  “signature”. The Authentication header has the following form:

Authorization: AWS AWSAccessKeyId:Signature

The process in charge of getting the Authorization header is defined as follows:

  1. Authorization = “AWS” + “ “ + AWSAccessKeyId + “:” + Signature;
  2. Signature = Base64( HMAC-SHA1( YourSecretAccessKeyId, UTF-8-Encoding-Of (StringToSign)));
  3. StringToSign = HTTP-Verb + “\n” +

Content-MD5 + “\n” +

Content-Type + “\n” +

Date + “\n” +

CanonicalizedAmzHeaders +

CanonicalizedResource;

  1. CanonicalizedResource = [ “/” + Bucket ] +

<HTTP-Request-URI, from the protocol name up to the query string> +

[ subresource, if present. For example “?acl”, “?location”, “?logging”, or “?torrent”];

  1. CanonicalAmzHeaders = canonicalization process executed on all the HTTP request headers that start with “x-amz-”.

To know more about authenticating requests in Amazon AWS and the “Authentication Header” method take a look at the following articles:

Timestamp requirement.

A valid timestamp (using either the HTTP Dateheader or the x-amz-date alternative) is mandatory for authenticated requests. The timestamp must be specified in the GMT time zone and following the format: “EEE, d MMM yyyy HH:mm:ss GMT”. 

Furthermore, the client timestamp included with an authenticated request must be within 15 minutes of the Amazon S3 system time when the request is received. If not, the request will fail with the RequestTimeTooSkewed error code. The intention of these restrictions is to limit the possibility that intercepted requests could be replayed by an adversary. For stronger protection against eavesdropping, use the HTTPS transport for authenticated requests.

When an “x-amz-date” header is present in a request, the system will ignore any “Date” header when computing the request signature. Therefore, if you include the “x-amz-date” header, use the empty string for the “Date” when constructing the “StringToSign”.

Authenticated operations.

Operations on buckets.

GET Bucket (List Objects)

This operation returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket.

  • Request

https://BucketName.s3.amazonaws.com/

  • Syntax

GET / HTTP/1.1

        Host: BucketName.s3.amazonaws.com

        Date: date

        Authorization: #authorization string#

  • Request Parameters
  • prefix: Limits the response to keys (file names) that begin with the specified prefix. The prefixes can be used to separate a bucket into different groupings of keys. A prefix can be used to make groups in the same way as a folder in a file system. This parameter is optional, if a prefix is not specified all the objects in the bucket will be returned.
  • marker: Specifies the key to start with when listing objects in a bucket. Amazon S3 returns object keys in alphabetical order, starting with the key after the marker. This parameter is optional.
  • max-keys: Sets the maximum number of keys returned in the response body. This parameter is optional. This parameter can be added to the request to retrieve fewer than the default 1000 keys. The response might contain fewer keys but will never contain more. If there are additional keys that satisfy the search criteria but were not returned because max-keys was exceeded, the response will contain <IsTruncated>true</IsTruncated>. To return the additional keys, the marker parameter can be used.

  • Sample Request Using Request Parameters

https://quotes.s3.amazonaws.com/?prefix=N&marker=Ned&max-keys=40

This request returns up to 40 keys in the "quotes" bucket that start with "N" and occur lexicographically after "Ned".

GET /?prefix=N&marker=Ned&max-keys=40 HTTP/1.1

Host: quotes.s3.amazonaws.com

Date: Wed, 01 Mar  2006 12:00:00 GMT

Authorization: authorization string

  • Sample Response

HTTP/1.1 200 OK
x-amz-id-2: gyB+3jRPnrkN98ZajxHXr3u7EFM67bNgSAxexeEHndCX/7GRnfTXxReKUQF28IfP
x-amz-request-id: 3B3C7C725673C630
Date: Wed, 01 Mar  2006 12:00:00 GMT
Content-Type: application/xml
Content-Length: 302
Connection: close
Server: AmazonS3

<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
 <Name>quotes</Name>
 <Prefix>N</Prefix>
 <Marker>Ned</Marker>
 <MaxKeys>40</MaxKeys>
 <IsTruncated>false</IsTruncated>
 <Contents>
   <Key>Nelson</Key>
   <LastModified>2006-01-01T12:00:00.000Z</LastModified>
   <ETag>&quot;828ef3fdfa96f00ad9f27c383fc9ac7f&quot;</ETag>
   <Size>5</Size>
   <StorageClass>STANDARD</StorageClass>
   <Owner>
     <ID>bcaf161ca5fb16fd081034f</ID>
     <DisplayName>webfile</DisplayName>
    </Owner>
 </Contents>
 <Contents>
   <Key>Neo</Key>
   <LastModified>2006-01-01T12:00:00.000Z</LastModified>
   <ETag>&quot;828ef3fdfa96f00ad9f27c383fc9ac7f&quot;</ETag>
   <Size>4</Size>
   <StorageClass>STANDARD</StorageClass>
    <Owner>
     <ID>bcaf1ffd86a5fb16fd081034f</ID>
     <DisplayName>webfile</DisplayName>
   </Owner>
</Contents>
</ListBucketResult>

Operations on objects.

GET Object

This implementation of the GET operation retrieves objects from Amazon S3. 

An Amazon S3 bucket does not have a directory hierarchy like the one in a typical file system. However, it is possible to create a logical hierarchy by using object key names that imply a folder structure. For example, instead of naming an object sample.jpg, it is possible to  name it as photos/2006/February/sample.jpg.

To get an object from such a logical hierarchy, specify the full key name for the object in the GET operation. For a virtual hosted-style request example, if the object is named  photos/2006/February/sample.jpg, the resource is referenced as /photos/2006/February/sample.jpg:

https://bucket.s3.amazonaws.com/photos/2006/February/sample.jpg

  • Request

https://BucketName.s3.amazonaws.com/ObjectName

  • Syntax

GET /ObjectName HTTP/1.1

                Host: BucketName.s3.amazonaws.com

                Date: date

                Authorization: #authorization string#

  • Sample Request

https://bucket.s3.amazonaws.com/my-image.jpg

The following request returns the object, my-image.jpg.

GET /my-image.jpg HTTP/1.1

Host: bucket.s3.amazonaws.com

Date: Wed, 28 Oct  2009 22:32:00 GMT

Authorization: authorization string

  • Sample Response

HTTP/1.1 200 OK
x-amz-id-2: eftixk72aD6Ap51TnqcoF8eFidJG9Z/2mkiDFu8yU9AS1ed4OpIszj7UDNEHGran
x-amz-request-id: 318BC8BC148832E5
Date: Wed, 28 Oct 2009 22:32:00 GMT
Last-Modified: Wed, 12 Oct 2009 17:50:00 GMT
ETag: "fba9dede5f27731c9771645a39863328"
Content-Length: 434234
Content-Type: text/plain
Connection: close
Server: AmazonS3
[434234 bytes of object data]

Getting the content of an object

Creating a parameterized data source to get the content of any object

  1. Select the type of data source needed depending on the type of file want to recover from Amazon S3. In this example, a Delimited File data source will be used.

  1. Select the HTTP Client option as the Data Route parameter.

  1. Configure the “HTTP Client” data route to access an Amazon S3 stored object.
  1. HTTP Method: GET.
  2. URL:

https://bucketname.s3.amazonaws.com/bucketobject

In this example, we are going to parameterize the “bucketname” and “bucketobject” of the http request with the param_host and param_filepath input parameters, which will allow to recover any object from any bucket.

https://@{param_host}@{param_filepath}

  1. Uncheck the “Test connection” checkbox.

This allows to create the data source without testing the connection and without specifying a value for the input parameters.

We will specify these parameters when creating base views on the data source only.

  1. Configure the “HTTP Headers”.

We need to create two headers, Authorization and          (or “date”), to recover objects from a bucket. These headers have to be parameterized as well.

Click OK, to confirm the HTTP Client configuration.

  1. At this point if you press again the OK button, it will be shown the configuration main window of the data source. Now, depending on the file to be accessed, the appropriate data source configuration has to be entered.

  1. Finally, after clicking on the “Save” button the data source is created showing the configuration set in the previous steps.

Creating a base view from the data source

  1. Click on Create base view button.

At this moment, you have to set the values for the mandatory input parameters of the delimited data source configured in the previous step:

  • param_host: url of the host that contains the object to recover. It is composed of “bucket_name” + “.s3.amazonaws.com”.
  • param_filepath: name of the object to recover. It should start with the “/” character.
  • param_date: date when the object is requested. It will be used to send as the “x-amz-date” header and as an input parameter to calculate the “param_auth” parameter.auth
  • param_auth: value of the “Authorization” header that will be sent to the REST service. It contains the authorization signed string used by the REST service to authenticate and authorize the corresponding http request.

To be able to set the param_auth value the process that calculates the AWS Authorization Request Header has to be implemented. To know more about this header you can take a look to the following article: Authentication Header. An easy way to implement this authentication mechanism is to use a VDP custom function that could be invoked any time the view is queried to create the Authorization header that is needed for the authenticated requests.

Note: Remember that there is a limit of 15 minutes between the query’s authentication parameters are set up and the time that the request arrives to the Amazon S3 service.

  1. After that, the settings window of the new base view will be shown. Change the name of the view and modify the fields as needed and click on the OK button.

Creating a view to obtain the authentication parameters

After implementing the process that calculates the AWS Authorization Request Header using a custom function, for example with the “aws_auth” name, a view can be created to obtain the Authorization header running the following query on the VQL Shell:

CREATE OR REPLACE VIEW <view_name> AS SELECT

formatdate('EEE, d MMM yyyy HH:mm:ss z', now()) AS date_gmt,

aws_auth(param_access_key, param_secret_key, param_host, param_filepath, formatdate('EEE, d MMM yyyy HH:mm:ss z', now())) AS aws_auth,

param_access_key, param_secret_key, param_host, param_filepath

FROM dual()

USING PARAMETERS ( param_access_key : text,

param_secret_key : text,

param_host : text,

param_filepath : text)

CONTEXT ('I18N' = 'rfc_restful')

The following input parameters to the view will be needed for the view to be used as arguments of the custom function:

  • access_key: access key provided by Amazon S3.
  • secret_key: secret key provided by Amazon S3.
  • host: host where the file/object to retrieve is located.
  • filepath: filepath of the object to recover from the previously specified host.

The output of the view will consist of:

  • aws_auth: the result of the process that calculates the Authorization header. The syntax of this output will be “AWS AWSAccessKeyId:Signature”.
  • date_gmt: date used to calculate the authorization header, in GMT time zone.

This output will be used to provide the values for the the “x-amz-date” and “Authorization” headers of the data sources respectively.

Creating a final view to get the content of an object

  1. Create a derived view joining the two previously created views: dv_aws_auth”, that gets the authorization data needed for the authenticated request, and delimited_file_bvthat gets the actual object from Amazon S3.

  1. Set the following join conditions:

  • dv_aws_auth.date_gmt = delimited_file_bv.param_date
  • dv_aws_auth.aws_auth = delimited_file_bv.param_auth
  • dv_aws_auth.param_host = delimited_file_bv.param_host
  • dv_aws_auth.param_filepath = delimited_file_bv.param_filepath

  1. Set the name of the view and remove from the output the following redundant fields:

  • delimited_file_bv.param_host
  • delimited_file_bv.param_filepath

Obtaining an object from a bucket

In the Admin Tool, set the “Tools” > “Admin Tools preferences” > “Locale” > “Internationalize query results” parameter to true.

Run the following query from the VQL Shell on the “delimited_file_dv” view providing values for the input parameters as where conditions, for instance:

SELECT * FROM delimited_file_dv

WHERE param_host = 'denodotest.s3.amazonaws.com'

and param_filepath = '/textfile/wordcount_output'

and param_access_key = #ACCESS_KEY#

and param_secret_key = #SECRET_KEY#

CONTEXT ('i18n'='rfc_restful')

Note: the i18n context parameter is set to ‘rfc_restful’ because the timestamp provided to the Amazon authentication needs to be in the GMT timezone, the rcf_restful internationalization map is a predefined map with this timezone, any other map with this timezone can be used.

Getting the list of object names in a bucket

Creating a parameterized data source to get the list of object names with a prefix

  1. Select the XML data source type to create a data source that recovers the list of filenames stored in a bucket.

  1. Select the HTTP Client option to set the Data Route parameter.

  1. Configure the “HTTP Client” data route to access any Amazon S3 bucket.
  1. HTTP Method: GET.
  2. URL:

https://bucketname.s3.amazonaws.com/?prefix=textfile/wordcount

In this case, to build the URL, the host and prefix will be parameters to recover the list of file names stored in any bucket.

https://@{param_host}/?prefix=@{param_prefix}

  1. Configure the “HTTP Headers”.

As with any authenticated request the headers “Authorization” and “x-amz-date” are needed and will be parameterized as well.

Click OK to confirm the HTTP Client configuration.

Creating a base view from the data source

  1. Click on Create base view button.

To create the base view,  set the values for the mandatory input parameters of the delimited data source configured in the previous step:

  • param_host: url of the host to send the GET object names request. It is composed of “bucket_name” + “.s3.amazonaws.com”.
  • param_prefix: prefix used to filter the object names to recover. Limits the response to keys (file names) that begin with the specified prefix.
  • param_date: date when the request is sent. It will be used to send as the “x-amz-date” header and as an input parameter to calculate the “param_auth” parameter.
  • param_auth: value of the Authorization header that will be sent to the REST service. It contains the authorization signed string used by the REST service to authenticate and authorize the corresponding http request.

The same custom function described before can be used to set the param_auth value.

If you want to obtain the values for param_authand param_date when you are trying to get, for instance, the content for the following request https://denodotest.s3.amazonaws.com/?prefix=textfile/wordcount , you have to set the params of the query with these values:

  • param_host: ’denodotest.s3.amazonaws.com’
  • param_filepath: ’/’

And then use the generated param_authand param_dateto fill the Create base view form.

  1. After providing the values the following window is displayed.

Select the Stream output at specified level option and the Contents array field in order to flatten the output of the view by the selected field and click OK to create the base view.

After creating this base view, the same steps indicated for the GET Object method can be used to create a final view: creating a view with the custom function to construct the authorization header and a join view between this view and the base view to query the data directly.

Creating a final view to list the object names in a bucket

  1. Create a derived view joining the two previously created views: dv_aws_auth”, that gets the authorization data needed for the authenticated request, and get_bucket_list_dvthat lists the object names in a bucket.

        

  1. Set the following conditions:

  • dv_aws_auth.date_gmt = get_bucket_list_dv.param_date
  • dv_aws_auth.aws_auth = get_bucket_list_dv.param_auth
  • dv_aws_auth.param_host = get_bucket_list_dv.param_host

  1. Set the name of the view and remove from the output the following redundant fields:

  • delimited_file_bv.param_host

Obtaining the list of object names in a bucket

In the Admin Tool, set the “Tools” > “Admin Tools preferences” > “Locale” > “Internationalize query results” parameter to true.

Run the following query from the VQL Shell on the “bucket_list_dv” view providing values for the input parameters as where conditions, for instance:

SELECT * FROM bucket_list_dv

WHERE param_host = 'denodotest.s3.amazonaws.com'

and param_filepath = '/textfile/wordcount_output'

and param_filepath = '/'

and param_access_key = #ACCESS_KEY#

and param_secret_key = #SECRET_KEY#

CONTEXT ('i18n'='rfc_restful')

Note: the i18n context parameter is set to ‘rfc_restful’ because the timestamp provided to the Amazon authentication needs to be in the GMT timezone, the rcf_restful internationalization map is a predefined map with this timezone, any other map with this timezone can be used.

Getting the content of the objects listed in a bucket

Creating a final view to get the content of the objects listed in a bucket

  1. Create a derived view joining the two previously created views: bucket_list_dv”, that lists the object names in a bucket, and delimited_file_dvthat gets the actual object from Amazon S3.

  1. Set the following conditions:

  • bucket_list_dv.param_access_key = delimited_file_dv.param_access_key
  • bucket_list_dv.param_secret_key = delimited_file_dv.param_secret_key
  • bucket_list_dv.param_host = delimited_file_dv.param_host
  • ('/'||bucket_list_dv.key) = delimited_file_dv.param_filepath

Note: The last condition adds a “/” to the bucket_list_dv.keysince the corresponding field delimited_file_dv.param_filepathmust start with the “/” character.

  1. Set the name of the view and remove from the output the following redundant fields:

  • delimited_file_bv.date_gmt
  • delimited_file_bv.aws_auth
  • delimited_file_bv.param_access_key
  • delimited_file_bv.param_secret_key
  • delimited_file_bv.param_host
  • delimited_file_bv.param_filepath
  • delimited_file_bv.param_auth
  • delimited_file_bv.param_date

Obtaining the content of the objects listed in a bucket

In the Admin Tool, set the “Tools” > “Admin Tools preferences” > “Locale” > “Internationalize query results” parameter to true.

Run the following query from the VQL Shell on the “delimited_files_in_bucket” view providing values for the input parameters as where conditions, for instance:

SELECT * FROM delimited_files_in_bucket

WHERE param_host = 'denodotest.s3.amazonaws.com'

and param_filepath = '/'

and param_prefix= 'textfile/wordcount'

and param_access_key = #ACCESS_KEY#

and param_secret_key = #SECRET_KEY#

CONTEXT ('i18n'='rfc_restful')

Note: the i18n context parameter is set to ‘rfc_restful’ because the timestamp provided to the Amazon authentication needs to be in the GMT timezone, the rcf_restful internationalization map is a predefined map with this timezone, any other map with this timezone can be used.

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training