How to integrate Amazon S3 with Denodo Distributed File System Custom Wrapper

Applies to: Denodo 8.0 , Denodo 7.0
Last modified on: 26 Aug 2020
Tags: Amazon S3 Cloud

Download document

You can translate the document:

Goal

This document describes how to connect to AWS S3 from Denodo Virtual DataPort using the Denodo Distributed File System Custom Wrapper.

Content

Amazon S3 stores data as objects within resources called "buckets". You can store as many objects as you want within a bucket, write, read, and delete objects in your bucket. Objects can be up to 5 terabytes in size.

Using the Denodo Distributed File System Custom Wrapper available at the Denodo Support Site we are able to access delimited files, as well as non-standard files (Avro, Map files, Sequence files) stored in HDFS and AWS S3. This option allows accessing delimited files (CSV), Avro, Map files, and Sequence files and Parquet files. In the case of Parquet Files, the wrapper is able to push down predicate evaluations and column projections in order to reduce the amount of data needed to be transferred to the Denodo server in scenarios where data is being filtered. Any other file format stored in S3, such as JSON or XML can’t be accessed through these data source types.

Connecting to an AWS S3 bucket from the Denodo Platform

Importing the custom wrapper into Virtual DataPort


The Denodo Distributed File System Custom Wrapper component is available to download for Denodo support users from the
Denodo Connects section of the Denodo Support Site.


In order to use the Denodo Distributed File System Custom Wrapper in Virtual DataPort, configure the Web Design Studio or Virtual DataPort Administration Tool to import the extension.

From the denodo-dfs-customwrapper distribution, select the denodo-(h)dfs-

customwrapper-${version}-jar-with-dependencies.jar file and upload it to Virtual DataPort. Open the Denodo Web Design Studio or Virtual DataPort Administration Tool and:

  • Go to “File > Extension management” and create a new item selecting the jar file.

Creating a sample Custom Data Source to read the Delimited file stored in AWS S3 bucket

  • Create a new data source by selecting “File > New > Data source > Custom”. This will open the wizard to create a connection to a data source with a custom wrapper.

  • Check ‘Select Jars’ and select the jar file of the custom wrapper. Specify the wrapper’s Class name as com.denodo.connect.dfs.wrapper.DFSDelimitedTextFileWrapper (com.denodo.connect.hadoop.hdfs.wrapper.HDFSDelimitedTextFileWrapper in Denodo 7).

 

  • Click on Refresh Input Parameters (Refresh icon if using the Virtual DataPort Administration Tool).
  • After refreshing the input parameters new input parameters will be available. If no input parameters are displayed update your custom wrapper distribution to the latest version.
  • File System URI: A URI whose scheme and authority identify the file system. For example: s3a://<bucket>

Optional parameters:

  • Custom core-site.xml file: configuration file that overrides the default core
    parameters.
  • Custom hdfs-site xml file: configuration file that overrides the default HDFS
    parameters.

Creating a Base View

  • Once the data source is created, create base views for that particular source. In order to do that click on "Create Base View" option
  • Specify the following to create a base view.

  • Path: Input path for the delimited file or the directory containing the files.
  • Filename pattern: If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files. For example, if you want the base view to return the data of all the files with the extension CSV set the File name pattern to (.*)\.csv. Optional.
  • Delete after reading: Requests that the file or directory denoted by the path

be deleted when the wrapper terminates.

  • Include full path column: If selected, the wrapper adds a column in the view with the full path of the file from which the data of every row are obtained.
  • Separator: delimiter between the values of a row. Default is the comma (,) and
    cannot be a line break (\n or \r).
  • Quote: Character used to encapsulate values containing special characters. Default is a quote (“).
  • Comment marker: Character marking the start of a line comment. Comments are disabled by default.
  • Escape: Escape character. Escapes are disabled by default.
  • Null value: String used to represent a null value. Default is: none; nulls are not distinguished from empty strings.
  • Ignore spaces: Whether spaces around values are ignored. False by default.
  • Header: If selected, the wrapper considers that the first line contains the names
    of the fields in this file. These names will be the fields’ names of the base views
    created from this wrapper.
    True by default
  • Ignore matching errors: Whether the wrapper will ignore the lines of the file that do not have the expected number of columns. True by default. If you clear this check box, the wrapper will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the wrapper has ignored any row in a query in the execution trace, in the attribute “Number of invalid rows”.
  • File encoding: You can indicate the encoding of the files to read in this parameter.

The below image is an example of creating a base view over the AWS S3 data source.

  • Click ‘Ok’ to create a base view.

  • Now, the base view created on top of Delimited file stored in AWS S3 bucket is ready for execution and to be combined with the rest of the sources

Creating a sample Custom Data Source to read the Parquet files stored in AWS S3 bucket

  • From the Virtual DataPort Administration tool, create a new data source by selecting “File > New > Data source > Custom”. This will open the wizard to create a connection to a data source with a custom wrapper.

  • Specify the wrapper’s Class name as com.denodo.connect.dfs.wrapper.S3ParquetFileWrapper  (com.denodo.connect.hadoop.hdfs.wrapper.S3ParquetFileWrapper in Denodo 7). Also, check ‘Select Jars’ and select the jar file of the custom wrapper.

 

  • Click on Refresh Input Parameters (Refresh icon if using the Virtual DataPort Administration Tool).
  • After refreshing the input parameters new input parameters will be available. If no input parameters are displayed update your custom wrapper distribution to the latest version.
  • File system URI: A URI whose scheme and authority identify the file system, in this case Amazon S3. s3a://<bucket>
  • Access Key ID: The access Key ID using s3a. This parameter sets the fs.s3a.access.key parameter.
  • Secret Access Key: The Secret Access Key using s3a. This parameter sets the fs.s3a.secret.key parameter.
  • IAM Role to Assume: The Amazon S3 IAM Role to Assume. This parameter sets the fs.s3a.assumed.role.arn parameter. This parameter is necessary to access S3 buckets with IAM Role access.
  • Endpoint: The Amazon S3 endpoint using s3a. This parameter sets the fs.s3a.endpoint parameter. This parameter is used to set a specific region endpoint.
  • Use EC2 IAM credentials: If selected, the wrapper uses the com.amazonaws.auth.InstanceProfileCredentialsProvider to obtain the credentials from the actual EC2 instance. This functionality only works if the Denodo platform is running on an EC2 instance, and this instance has an IAM role configured.
  • Custom core-site.xml file: configuration file that overrides the default core parameters, except Access Key ID, Secret Access Key and Endpoint.
  • Thread Pool size:  the maximum threads to allow in the pool. If it is not set, the value is calculated according to the available processors. This parameter makes sense when Parquet files are going to be read in parallel.

Creating a Base View

  • Once the data source is created, create base views for that particular source. In order to do that click on "Create Base View" option
  • Specify the following to create a base view.
  • Parquet File Path: path of the file that we want to read.
  • File name pattern:  If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files, including the sequence of directories they belong to.
  • Include full path column: If selected, the wrapper adds a column in the view with the full path of the file from which the data of every row are obtained.
  • Parallelism type: Chooses the reading strategy.
  • Parallelism level: How many threads are going to read the Parquet file simultaneously, if the parallelism is enabled. If it is not configured by the user, the value is calculated according to the available processors.
  • Cluster/partition fields:  Fields by which the file was partitioned or clustered, if any. These fields will act as a hint to the Automatic parallelism type, that chooses the optimum strategy to read the file.
  • Click ‘Ok’ to create a base view.

  • Now, the base view created on top of Delimited file stored in AWS S3 bucket is ready for the execution and to be combined with the rest of the sources

References

Virtual DataPort Administration Guide: Custom Sources

Denodo Distributed File System Custom Wrapper  - User Manual

How To Video: How to Connect to AWS S3 from Denodo Platform

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training