Applies to:
Denodo 7.0
Last modified on: 11 Jul 2019
Tags:
XML data sources
DF data sources
Amazon S3
Amazon Simple Storage Service (Amazon S3), is a service that provides to developers and IT teams a secure, durable and highly-scalable object storage.
Amazon S3 stores data as objects within resources called "buckets”. It is possible to store as many objects as needed within a bucket. As well as write, read, and delete objects in a bucket.
There are several ways to integrate this service. The AWS Management Console provides a web-based interface for accessing and managing all available Amazon S3 resources. Programmatic access to Amazon S3 is enabled by the AWS SDKs and the AWS SDK Mobile. In addition, Amazon S3 provides a full REST API which will be the studied feature in this document.
In addition, HDFS Custom Wrappers available at the Support Site also support S3 routes, so they can be used to access directly data in S3. This is the recommended option to access Amanzon S3, for more information read How to integrate Amazon S3 with Denodo Distributed File System Custom Wrapper.
Authorization: AWS AWSAccessKeyId:Signature
This document describes the way to access file types not supported by the HDFS Custom Wrappers, such as JSON or XML, through the Amazon S3 REST API.
In this section, we will introduce the Amazon S3 REST API and describe some of its most useful features.
Requests to Amazon S3 can be authenticated or anonymous. In case authenticated requests are used, credentials will be required so AWS can authenticate the requests. When making a REST API call a signature has to be created using the user credentials and then included in the request. Being correctly authenticated is not enough to access to an Amazon S3 resource, the user still needs to have enough permissions on the resource to access.
In the following sections we will describe some of the main requirements for authenticated request and two of the most important authenticated Amazon S3 REST API methods: “Get Bucket” and “Get Object”.
For more information about the REST API, see the API Reference for Amazon S3.
Authenticated requests in Amazon S3 need to send authentication information in order to be performed. This section introduces request authentication with the “AWS Signature version 4” algorithm.
The authentication information sent in a request must include a “signature” that is calculated following a detailed process. In “AWS Signature version 4”, the basic signing requests process is composed of the following steps:
The authentication method used in the examples, that will be described in this document, is the “Authorization Header” method which consists in adding an Authorization header to the request containing the calculated “signature”. The Authentication header has the following form:
Authorization: AWS AWSAccessKeyId:Signature
The process in charge of getting the Authorization header is defined as follows:
Content-MD5 + “\n” +
Content-Type + “\n” +
Date + “\n” +
CanonicalizedAmzHeaders +
CanonicalizedResource;
<HTTP-Request-URI, from the protocol name up to the query string> +
[ subresource, if present. For example “?acl”, “?location”, “?logging”, or “?torrent”];
To know more about authenticating requests in Amazon AWS and the “Authentication Header” method take a look at the following articles:
A valid timestamp (using either the HTTP “Date” header or the “x-amz-date” alternative) is mandatory for authenticated requests. The timestamp must be specified in the GMT time zone and following the format: “EEE, d MMM yyyy HH:mm:ss GMT”.
Furthermore, the client timestamp included with an authenticated request must be within 15 minutes of the Amazon S3 system time when the request is received. If not, the request will fail with the “RequestTimeTooSkewed” error code. The intention of these restrictions is to limit the possibility that intercepted requests could be replayed by an adversary. For stronger protection against eavesdropping, use the HTTPS transport for authenticated requests.
When an “x-amz-date” header is present in a request, the system will ignore any “Date” header when computing the request signature. Therefore, if you include the “x-amz-date” header, use the empty string for the “Date” when constructing the “StringToSign”.
This operation returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket.
https://BucketName.s3.amazonaws.com/
GET / HTTP/1.1
Host: BucketName.s3.amazonaws.com
Date: date
Authorization: #authorization string#
https://quotes.s3.amazonaws.com/?prefix=N&marker=Ned&max-keys=40
This request returns up to 40 keys in the "quotes" bucket that start with "N" and occur lexicographically after "Ned".
GET /?prefix=N&marker=Ned&max-keys=40 HTTP/1.1
Host: quotes.s3.amazonaws.com
Date: Wed, 01 Mar 2006 12:00:00 GMT
Authorization: authorization string
HTTP/1.1 200 OK
x-amz-id-2: gyB+3jRPnrkN98ZajxHXr3u7EFM67bNgSAxexeEHndCX/7GRnfTXxReKUQF28IfP
x-amz-request-id: 3B3C7C725673C630
Date: Wed, 01 Mar 2006 12:00:00 GMT
Content-Type: application/xml
Content-Length: 302
Connection: close
Server: AmazonS3
<?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<Name>quotes</Name>
<Prefix>N</Prefix>
<Marker>Ned</Marker>
<MaxKeys>40</MaxKeys>
<IsTruncated>false</IsTruncated>
<Contents>
<Key>Nelson</Key>
<LastModified>2006-01-01T12:00:00.000Z</LastModified>
<ETag>"828ef3fdfa96f00ad9f27c383fc9ac7f"</ETag>
<Size>5</Size>
<StorageClass>STANDARD</StorageClass>
<Owner>
<ID>bcaf161ca5fb16fd081034f</ID>
<DisplayName>webfile</DisplayName>
</Owner>
</Contents>
<Contents>
<Key>Neo</Key>
<LastModified>2006-01-01T12:00:00.000Z</LastModified>
<ETag>"828ef3fdfa96f00ad9f27c383fc9ac7f"</ETag>
<Size>4</Size>
<StorageClass>STANDARD</StorageClass>
<Owner>
<ID>bcaf1ffd86a5fb16fd081034f</ID>
<DisplayName>webfile</DisplayName>
</Owner>
</Contents>
</ListBucketResult>
This implementation of the GET operation retrieves objects from Amazon S3.
An Amazon S3 bucket does not have a directory hierarchy like the one in a typical file system. However, it is possible to create a logical hierarchy by using object key names that imply a folder structure. For example, instead of naming an object sample.jpg, it is possible to name it as photos/2006/February/sample.jpg.
To get an object from such a logical hierarchy, specify the full key name for the object in the GET operation. For a virtual hosted-style request example, if the object is named photos/2006/February/sample.jpg, the resource is referenced as /photos/2006/February/sample.jpg:
https://bucket.s3.amazonaws.com/photos/2006/February/sample.jpg
https://BucketName.s3.amazonaws.com/ObjectName
GET /ObjectName HTTP/1.1
Host: BucketName.s3.amazonaws.com
Date: date
Authorization: #authorization string#
https://bucket.s3.amazonaws.com/my-image.jpg
The following request returns the object, my-image.jpg.
GET /my-image.jpg HTTP/1.1
Host: bucket.s3.amazonaws.com
Date: Wed, 28 Oct 2009 22:32:00 GMT
Authorization: authorization string
HTTP/1.1 200 OK
x-amz-id-2: eftixk72aD6Ap51TnqcoF8eFidJG9Z/2mkiDFu8yU9AS1ed4OpIszj7UDNEHGran
x-amz-request-id: 318BC8BC148832E5
Date: Wed, 28 Oct 2009 22:32:00 GMT
Last-Modified: Wed, 12 Oct 2009 17:50:00 GMT
ETag: "fba9dede5f27731c9771645a39863328"
Content-Length: 434234
Content-Type: text/plain
Connection: close
Server: AmazonS3
[434234 bytes of object data]
https://bucketname.s3.amazonaws.com/bucketobject
In this example, we are going to parameterize the “bucketname” and “bucketobject” of the http request with the “param_host” and “param_filepath” input parameters, which will allow to recover any object from any bucket.
https://@{param_host}@{param_filepath}
This allows to create the data source without testing the connection and without specifying a value for the input parameters.
We will specify these parameters when creating base views on the data source only.
We need to create two headers, “Authorization” and “ ” (or “date”), to recover objects from a bucket. These headers have to be parameterized as well.
Click “OK”, to confirm the “HTTP Client” configuration.
At this moment, you have to set the values for the mandatory input parameters of the delimited data source configured in the previous step:
To be able to set the param_auth value the process that calculates the AWS Authorization Request Header has to be implemented. To know more about this header you can take a look to the following article: Authentication Header. An easy way to implement this authentication mechanism is to use a VDP custom function that could be invoked any time the view is queried to create the Authorization header that is needed for the authenticated requests.
Note: Remember that there is a limit of 15 minutes between the query’s authentication parameters are set up and the time that the request arrives to the Amazon S3 service.
After implementing the process that calculates the AWS Authorization Request Header using a custom function, for example with the “aws_auth” name, a view can be created to obtain the Authorization header running the following query on the VQL Shell:
CREATE OR REPLACE VIEW <view_name> AS SELECT
formatdate('EEE, d MMM yyyy HH:mm:ss z', now()) AS date_gmt,
aws_auth(param_access_key, param_secret_key, param_host, param_filepath, formatdate('EEE, d MMM yyyy HH:mm:ss z', now())) AS aws_auth,
param_access_key, param_secret_key, param_host, param_filepath
FROM dual()
USING PARAMETERS ( param_access_key : text,
param_secret_key : text,
param_host : text,
param_filepath : text)
CONTEXT ('I18N' = 'rfc_restful')
The following input parameters to the view will be needed for the view to be used as arguments of the custom function:
The output of the view will consist of:
This output will be used to provide the values for the the “x-amz-date” and “Authorization” headers of the data sources respectively.
In the Admin Tool, set the “Tools” > “Admin Tools preferences” > “Locale” > “Internationalize query results” parameter to true.
Run the following query from the VQL Shell on the “delimited_file_dv” view providing values for the input parameters as where conditions, for instance:
SELECT * FROM delimited_file_dv
WHERE param_host = 'denodotest.s3.amazonaws.com'
and param_filepath = '/textfile/wordcount_output'
and param_access_key = #ACCESS_KEY#
and param_secret_key = #SECRET_KEY#
CONTEXT ('i18n'='rfc_restful')
Note: the i18n context parameter is set to ‘rfc_restful’ because the timestamp provided to the Amazon authentication needs to be in the GMT timezone, the rcf_restful internationalization map is a predefined map with this timezone, any other map with this timezone can be used.
https://bucketname.s3.amazonaws.com/?prefix=textfile/wordcount
In this case, to build the URL, the host and prefix will be parameters to recover the list of file names stored in any bucket.
https://@{param_host}/?prefix=@{param_prefix}
As with any authenticated request the headers “Authorization” and “x-amz-date” are needed and will be parameterized as well.
Click “OK” to confirm the “HTTP Client” configuration.
To create the base view, set the values for the mandatory input parameters of the delimited data source configured in the previous step:
The same custom function described before can be used to set the “param_auth” value.
If you want to obtain the values for “param_auth” and “param_date” when you are trying to get, for instance, the content for the following request https://denodotest.s3.amazonaws.com/?prefix=textfile/wordcount , you have to set the params of the query with these values:
And then use the generated “param_auth” and “param_date” to fill the “Create base view” form.
Select the “Stream output at specified level” option and the “Contents” array field in order to flatten the output of the view by the selected field and click “OK” to create the base view.
After creating this base view, the same steps indicated for the GET Object method can be used to create a final view: creating a view with the custom function to construct the authorization header and a join view between this view and the base view to query the data directly.
In the Admin Tool, set the “Tools” > “Admin Tools preferences” > “Locale” > “Internationalize query results” parameter to true.
Run the following query from the VQL Shell on the “bucket_list_dv” view providing values for the input parameters as where conditions, for instance:
SELECT * FROM bucket_list_dv
WHERE param_host = 'denodotest.s3.amazonaws.com'
and param_filepath = '/textfile/wordcount_output'
and param_filepath = '/'
and param_access_key = #ACCESS_KEY#
and param_secret_key = #SECRET_KEY#
CONTEXT ('i18n'='rfc_restful')
Note: the i18n context parameter is set to ‘rfc_restful’ because the timestamp provided to the Amazon authentication needs to be in the GMT timezone, the rcf_restful internationalization map is a predefined map with this timezone, any other map with this timezone can be used.
Note: The last condition adds a “/” to the “bucket_list_dv.key” since the corresponding field “delimited_file_dv.param_filepath” must start with the “/” character.
In the Admin Tool, set the “Tools” > “Admin Tools preferences” > “Locale” > “Internationalize query results” parameter to true.
Run the following query from the VQL Shell on the “delimited_files_in_bucket” view providing values for the input parameters as where conditions, for instance:
SELECT * FROM delimited_files_in_bucket
WHERE param_host = 'denodotest.s3.amazonaws.com'
and param_filepath = '/'
and param_prefix= 'textfile/wordcount'
and param_access_key = #ACCESS_KEY#
and param_secret_key = #SECRET_KEY#
CONTEXT ('i18n'='rfc_restful')
Note: the i18n context parameter is set to ‘rfc_restful’ because the timestamp provided to the Amazon authentication needs to be in the GMT timezone, the rcf_restful internationalization map is a predefined map with this timezone, any other map with this timezone can be used.
Denodo Distributed File System Custom Wrapper - User Manual