• User Manuals /
  • Denodo Distributed File System Custom Wrapper - User Manual

Denodo Distributed File System Custom Wrapper - User Manual

Download original document


You can translate the document:

Warning

Although this wrapper is capable of reading files stores in HDFS, S3, Azure Blob Storage, Azure Data Lake Storage and Google Cloud Storage, most of the technical artifacts of this wrapper, for Denodo 6.0 and Denodo 7.0, include HDFS in their names for legacy compatibility:

  • Jars: denodo-hdfs-custom-wrapper-xxx
  • Wrappers:com.denodo.connect.hadoop.hdfs.wrapper.HDFSxxxWrapper

As for Denodo 8.0 artifacts names have been rebranded:

  • Jars: denodo-dfs-custom-wrapper-xxx
  • Wrappers:com.denodo.connect.dfs.wrapper.HDFSxxxWrapper

Introduction

The Distributed File System Custom Wrapper distribution contains Virtual DataPort custom wrappers capable of reading several file formats stored in HDFS, S3, Azure Data Lake Storage, Azure Blob Storage, Azure Data Lake Storage Gen 2, Google Cloud Storage and Alibaba Object Storage Service(assumed roles are not supported).

Supported formats are:

  • Delimited text
  • Parquet
  • Avro
  • Sequence
  • Map

Also, there is a custom wrapper to retrieve information from the distributed file system and display it in a relational way:

  • DFSListFilesWrapper: This wrapper allows you to inspect distributed folders, retrieve lists of files (in a single folder or recursively) and filter files using any part of its metadata (file name, file size, last modification date, etc.).

Usage

The Distributed File System Custom Wrapper distribution consists of:

  • /conf: A folder containing a sample core-site.xml file with properties you might need commented out.

  • /dist:

  • denodo-(h)dfs-customwrapper-${version}.jar. The custom wrapper.

  • denodo-(h)dfs-customwrapper-${version}-jar-with-dependencies.jar. The custom wrapper plus its dependencies. This is the recommended wrapper, as it is easier to install in VDP.

  • /lib: All the dependencies required by this wrapper in case you need to use the denodo-(h)dfs-customwrapper-${version}.jar 

Importing the custom wrapper into VDP

In order to use the Distributed File System Custom Wrapper in VDP, you must add it as an extension using the Admin Tool.

From the Distributed File System Custom Wrapper distribution, you will select the denodo-(h)dfs-customwrapper-${version}-jar-with-dependencies.jar file and upload it to VDP. No other jars are required as this one will already contain all the required dependencies.

Important

As the jar-with-dependencies version of this wrapper contains the Hadoop client libraries, increasing the JVM's heap space for VDP Admin Tool is required to avoid a Java heap space when uploading the jar to VDP.

Distributed File System extension in VDP

Creating a Distributed File System Data Source

Once the custom wrapper jar file has been uploaded to VDP, you can create new data sources for this custom wrapper --and their corresponding base views-- as usual.

Go to New → Data Source → Custom and specify one of the possible wrappers:

  • com.denodo.connect.hadoop.hdfs.wrapper.DFSListFilesWrapper
  • com.denodo.connect.dfs.wrapper.DFSListFilesWrapper (in Denodo 8.0)

  • com.denodo.connect.hadoop.hdfs.wrapper.HDFSAvroFileWrapper
  • com.denodo.connect.dfs.wrapper.DFSAvroFileWrapper (in Denodo 8.0)

  • com.denodo.connect.hadoop.hdfs.wrapper.HDFSDelimitedTextFileWrapper
  • com.denodo.connect.dfs.wrapper.DFSDelimitedTextFileWrapper (in Denodo 8.0)

  • com.denodo.connect.hadoop.hdfs.wrapper.HDFSMapFileWrapper
  • com.denodo.connect.dfs.wrapper.DFSMapFileWrapper (in Denodo 8.0)

  • com.denodo.connect.hadoop.hdfs.wrapper.HDFSParquetFileWrapper
  • com.denodo.connect.dfs.wrapper.DFSParquetFileWrapper (in Denodo 8.0)

  • com.denodo.connect.hadoop.hdfs.wrapper.HDFSSequenceFileWrapper
  • com.denodo.connect.dfs.wrapper.DFSSequenceFileWrapper (in Denodo 8.0)

  • com.denodo.connect.hadoop.hdfs.wrapper.S3ParquetFileWrapper
  • com.denodo.connect.dfs.wrapper.S3ParquetFileWrapper (in Denodo 8.0)

Also check ‘Select Jars’ and choose the jar file of the custom wrapper.

Distributed File System Data Source

Depending on the selected wrapper you will have different input parameters. To update the parameters, you must press the refresh button.

(H)DFSDelimitedTextFileWrapper

Custom wrapper for reading delimited text files.

Delimited text files store plain text and each line has values separated by a delimiter, such as tab, space, comma, etc.

This custom wrapper datasource need the following parameters:

  • File system URI: A URI whose scheme and authority identify the file system.

  • HDFS: hdfs://<ip>:<port>.

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

        

  • Azure Data Lake Storage:

adl://<account name>.azuredatalakestore.net/

           For configuring the credentials see Azure Data Lake Storage section.

  • Azure Blob Storage:

wasb[s]://<container>\@<account>.blob.core.windows.net

For configuring the credentials see Azure Blob Storage section.

  • Azure Data Lake Storage Gen 2:

abfs[s]://<filesystem>\@<account>.dfs.core.windows.net

For configuring the credentials see Azure Data Lake Storage Gen 2 section.

  • Google Cloud Storage:

gs://<bucket>

For configuring the credentials see Google Storage section.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the URI contains @ you have to enter \@.

  • Custom core-site.xml file (optional): configuration file that overrides the default core parameters.

  • Custom hdfs-site xml file (optional): configuration file that overrides the default HDFS parameters.

(H)DFSDelimitedTextFileWrapper data source edition

Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:

  • Path: input path for the delimited file or the directory containing the files.

  • File name pattern (optional): If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files, including the sequence of directories they belong to.

For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. invoice_jan.csv, invoice_feb.csv, … set the File name pattern to (.*)invoice_(.*)\\.csv, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:

  • /accounting/invoices/2019/invoice_jan.csv
  • /accounting/invoices/2019/invoice_feb.csv

Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern invoice_(.*)\\.csv will not find the sample results, as the file name starts with "/accounting...", not "invoice...".

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the File name pattern contains \ you have to enter \\.

  • Delete after reading: Requests that the file or directory denoted by the path be deleted when the wrapper terminates. Note that in case the "File name pattern" field contains a regular expression, instead of deleting the whole directory, only the files matching the regular expression will be deleted.

  • Include full path column If selected, the wrapper adds a column in the view with the full path of the file from which the data of every row are obtained.

  • Separator (optional): delimiter between the values of a row. Default is the comma (,) and cannot be a line break (\n or \r).

            Some “invisible” characters have to be entered in a special way:

Character

Meaning

 \t

 Tab

 \f

 Formfeed

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the separator is the tab character \t you have to enter \\t.

! Note

When a separator larger than one character is broken compatibility with the standard comma-separated-value cannot be kept and therefore parameters Quote, Comment Marker, Escape, Null value and Ignore Spaces are not supported.

  • Quote (optional): Character used to encapsulate values containing special characters. Default is quote (“).

  • Comment marker (optional): Character marking the start of a line comment. Comments are disabled by default.

  • Escape (optional): Escape character. Escapes are disabled by default.

  • Null value (optional): String used to represent a null value. Default is: none; nulls are not distinguished from empty strings.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the null value is \N you have to enter \\N.

  • Ignore spaces: Whether spaces around values are ignored. False by default.

  • Header:  If selected, the wrapper considers that the first line contains the names of the fields in this file. These names will be the fields’ names of the base views created from this wrapper. True by default.

  • Ignore matching errors: Whether the wrapper will ignore the lines of the file that do not have the expected number of columns. True by default.

If you clear this check box, the wrapper will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the wrapper has ignored any row in a query in the execution trace, in the attribute “Number of invalid rows”.

  • File encoding: You can indicate the encoding of the files to read in this parameter.

(H)DFSDelimitedTextFileWrapper base view edition

View schema

The execution of the wrapper returns the values contained in the file or group of files, if the Path input parameter denotes a directory.

View results

(H)DFSParquetFileWrapper

Custom wrapper for reading Parquet files.

Parquet is a column-oriented data store of the Hadoop ecosystem. It provides efficient data compression on a per-column level and encoding schemas.

This custom wrapper datasource need the following parameters:

  • File system URI: A URI whose scheme and authority identify the file system.

  • HDFS: hdfs://<ip>:<port>.

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

  • Azure Data Lake Storage:

adl://<account name>.azuredatalakestore.net/

           For configuring the credentials see Azure Data Lake Storage section.

  • Azure Blob Storage:

wasb[s]://<container>\@<account>.blob.core.windows.net

For configuring the credentials see Azure Blob Storage section.

  • Azure Data Lake Storage Gen 2:

abfs[s]://<filesystem>\@<account>.dfs.core.windows.net

For configuring the credentials see Azure Data Lake Storage Gen 2 section.

  • Google Cloud Storage:

gs://<bucket>

For configuring the credentials see Google Storage section.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the URI contains @ you have to enter \@.

  • Thread Pool size (optional): the maximum threads to allow in the pool. If it is not set, the value is calculated according to the available processors. This parameter makes sense when Parquet file(s) are going to be read in parallel. If the thread pool is going to be used by several instances at the same time (due to the execution of an operation such as join, union... or simply the simultaneous execution of many instances of the same view) it is important to keep in mind, in order to provide enough space.

  • Custom core-site.xml file (optional): configuration file that overrides the default core parameters.

  • Custom hdfs-site xml file (optional): configuration file that overrides the default HDFS parameters.

(H)DFSParquetFileWrapper data source edition

Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:

  • Parquet File Path: path of the file that you want to read.

! Note

A directory can be configured in the "Parquet File Path" parameter.

It is important to have in mind that if the value is a directory, all the files included in this directory must have the same parquet schema. If the schemas are not the same, there is no guarantee that the wrapper will be able to read them correctly.

  • File name pattern (optional): If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files, including the sequence of directories they belong to.

For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. flights_jan.parquet, flights_feb.parquet, … set the File name pattern to (.*)flights_(.*)\\.parquet, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:

  • /airport/LAX/2019/flights_jan.parquet
  • /airport/LAX/2019/flights_feb.parquet

Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern flights_(.*)\\.parquet will not find the sample results, as the file name starts with "/airport...", not "flights...".

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the File name pattern contains \ you have to enter \\.

  • Folder name exclusion (optional): If you want this wrapper to exclude any directory you can add the directory name in this parameter. You can add character “*” at the beginning or  the end  if you want to exclude a directory beginning or ending with the specified prefix or suffix.

  • Include full path column: If selected, the wrapper adds a column with the full path of the file from which the data of each row was obtained.

  • Parallelism type (mandatory): Chooses the reading strategy:

  • No Parallelism. The file/s will be read sequentially, using only one thread.

  • Automatic. This option automatically tries to choose the optimum reading strategy, among No Parallelism, By File, By Row Group or By Column, analyzing the Parquet metadata.

  • Parallelism by File. The file/s will be read in parallel, one file per thread configured in Parallelism level.

  • Parallelism by Row Group. The file/s will be read in parallel, with the total row groups of the file(s) splitted into the number of threads configured in Parallelism level.

Note that this kind of parallelism is supported since Denodo 7.0, due to limitations in prior versions of Parquet libraries.

  • Parallelism by Column. The file/s will be read in parallel, with the total columns of the file(s) splitted into the number of threads configured in Parallelism level.

  • Parallelism level (optional): How many threads are going to read the Parquet file/s simultaneously, if parallelism is enabled. If it is not configured by the user, the value is calculated according to the available processors.

! Note

In case multiple instances of views from this data source are executed at the same time (This can be the result of executing an operation like join, union... or simply the simultaneous execution of many instances of the same view). It is important to take into account the size of the thread pool defined in the data source as well as the level of parallelism defined in each view to be executed simultaneously.

  • Cluster/partition fields (optional): Fields by which the file was partitioned or clustered, if any. These fields will act as a hint to the Automatic parallelism type, that chooses the optimum strategy to read the file.
  • Ignore route errors (optional): If this option is checked, the wrapper will ignore the file indicated in "Parquet File Path" in case it doesn't exist. This option only applies in case you are using interpolation variables in the "Parquet File Path" field, because in case you specify a single file it must exist.
  • Ignore file errors (optional): This option is used to ignore errors on files that are in the path specified in the "Parquet File Path". The errors may be due to the parquet files are corrupt or the parquet files no longer exist at the time of reading (because they are temporary files for example). In case of ignoring these errors there is the option of showing the ignored files by activating the following log to TRACE "com.denodo.connect.dfs.reader.DFSParquetFileReader.IGNORED" in Denodo 8. In case of using Denodo 7 the logger is " com.denodo.connect.hadoop.hdfs.reader.HDFSParquetFileReader.IGNORED".

(H)DFSParquetFileWrapper base view edition

View schema

The execution of the wrapper returns the values contained in the file.

View results

Queries optimization

  1. Projection push down

Denodo will read only the selected columns from the Parquet file, avoiding reading columns unnecessarily.

  1. Predicate push down

Denodo will evaluate filtering predicates in the query against metadata stored in the Parquet files. This avoids reading large amounts of chunks of data improving query performance.

  1. Supported data types are:

  • BINARY (UTF8, JSON, BSON), BOOLEAN, DOUBLE, FLOAT, INT32 (DECIMAL, DATE), INT64

and operators:

  • eq, ne, lt, le, gt, ge

  1. Partition pruning

This optimization is possible when the Parquet dataset is splitted across multiple directories, with each value of the partition column stored in a subdirectory, e.g. /requestsPartitioned/responseCode=500/

Denodo can omit large amounts of I/O when the partition column is referenced in the WHERE clause.

S3ParquetFileWrapper

Custom wrapper for reading Parquet files in S3.

Parquet is a column-oriented data store that provides efficient data compression on a per-column level and encoding schemas.

 

This wrapper has the same behaviour as (H)DFSParquetFileWrapper but it accesses S3 exclusively, and it is much easier to configure.

This custom wrapper datasource need the following parameters:

  • File system URI: A URI whose scheme and authority identify the file system.

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the URI contains @ you have to enter \@.

  • Access Key ID (optional): The access Key ID using s3a. This parameter sets the fs.s3a.access.key parameter.

  • Secret Access Key (optional): The Secret Access Key using s3a. This parameter sets the fs.s3a.secret.key parameter.

  • IAM Role to Assume (optional): The Amazon S3 IAM Role to Assume. This parameter sets the fs.s3a.assumed.role.arn parameter. This parameter is necessary to access S3 buckets with IAM Role access.

  • Endpoint (optional): The S3 endpoint using s3a. This parameter sets the fs.s3a.endpoint parameter. This parameter is used to set a specific region endpoint. If not specified, the client will default to the us-east-1 region.

Endpoint is mandatory when accessing S3 compatible storage, that is, different from Amazon S3.

  • USE EC2 IAM credentials: If selected, the wrapper uses the com.amazonaws.auth.InstanceProfileCredentialsProvider to obtain the credentials from the actual EC2 instance. This functionality only works if the Denodo platform is running on an EC2 instance, and this instance has an IAM role configured.

  • Custom core-site.xml file (optional): configuration file that overrides the default core parameters, except Access Key ID, Secret Access Key and Endpoint.

  • Thread Pool size (optional): the maximum threads to allow in the pool. If it is not set, the value is calculated according to the available processors. This parameter makes sense when Parquet files are going to be read in parallel. If the thread pool is going to be used by several instances at the same time (due to the execution of an operation such as join, union... or simply the simultaneous execution of many instances of the same view) it is important to keep in mind, in order to provide enough space.

S3ParquetFileWrapper data source edition

You have different options to connect with an S3 bucket:

  • Public buckets: In this case you only need to configure File system URI and Parquet File Path to access the bucket.

  • IAM user: In this case you need to configure File system URI, Parquet File Path, Access Key ID and Secret Access Key to access the bucket.

  • IAM role: In this case you need to configure File system URI, Parquet File Path, Access Key ID, Secret Access Key and IAM Role to Assume to access the bucket.

  • Instance Role: In this case you only need to configure File system URI and Parquet File Path and select the Use EC2 IAM credentials to access the bucket. This option is only valid when denodo is running on an EC2 instance. If selected, the wrapper uses the com.amazonaws.auth.InstanceProfileCredentialsProvider to obtain the credentials from the actual EC2 instance

If you need to configure any other parameter of the S3 connection you can use a Custom core-site.xml file, as explained in the S3 section.

Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:

  • Parquet File Path: path of the file that you want to read.

! Note

A directory can be configured in the "Parquet File Path" parameter.

It is important to have in mind that if the value is a directory, all the files included in this directory must have the same parquet schema. If the schemas are not the same, there is no guarantee that the wrapper will be able to read them correctly.

  • File name pattern (optional): If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files, including the sequence of directories they belong to.

For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. flights_jan.parquet, flights_feb.parquet, … set the File name pattern to (.*)flights_(.*)\\.parquet, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:

  • /airport/LAX/2019/flights_jan.parquet
  • /airport/LAX/2019/flights_feb.parquet

Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern flights_(.*)\\.parquet will not find the sample results, as the file name starts with "/airport...", not "flights...".

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the File name pattern contains \ you have to enter \\.

  • Folder name exclusion (optional): If you want this wrapper to exclude any directory you can add the directory name in this parameter. You can add character “*” at the beginning or  the end  if you want to exclude a directory beginning or ending with the specified prefix or suffix.

  • Include full path column: If selected, the wrapper adds a column in the view with the full path of the file from which the data of every row are obtained.

  • Parallelism type (mandatory): Chooses the reading strategy:

  • No Parallelism. The file/s will be read sequentially, by only one thread.

  • Automatic. This option automatically tries to choose the optimum reading strategy, among No Parallelism, By File, By Row Group or By Column, analyzing the Parquet metadata.

  • Parallelism by File. The file/s will be read in parallel, one file per thread configured in Parallelism level.

  • Parallelism by Row Group. The file/s will be read in parallel, with the total row groups of the file(s) splitted into the number of threads configured in Parallelism level.

Note that this kind of parallelism is supported since Denodo 7.0, due to limitations in prior versions of Parquet libraries.

  • Parallelism by Column. The file/s will be read in parallel, with the total columns of the file(s) splitted into the number of threads configured in Parallelism level.

  • Parallelism level (optional): How many threads are going to read the Parquet file simultaneously, if the parallelism is enabled. If it is not configured by the user, the value is calculated according to the available processors.

! Note

In case multiple instances of views from this data source are executed at the same time (This can be the result of executing an operation like join, union... or simply the simultaneous execution of many instances of the same view). It is important to take into account the size of the thread pool defined in the data source as well as the level of parallelism defined in each view to be executed simultaneously.

  • Cluster/partition fields (optional): Fields by which the file was partitioned or clustered, if any. These fields will act as a hint to the Automatic parallelism type, that chooses the optimum strategy to read the file.
  • Ignore route errors (optional): If this option is checked, the wrapper will ignore the file indicated in "Parquet File Path" in case it doesn't exist. This option only applies in case you are using interpolation variables in the "Parquet File Path" field, because in case you specify a single file it must exist.
  • Ignore file errors (optional): This option is used to ignore errors on files that are in the path specified in the "Parquet File Path". The errors may be due to the parquet files are corrupt or the parquet files no longer exist at the time of reading (because they are temporary files for example). In case of ignoring these errors there is the option of showing the ignored files by activating the following log to TRACE "com.denodo.connect.dfs.reader.DFSParquetFileReader.IGNORED" in Denodo 8. In case of using Denodo 7 the logger is " com.denodo.connect.hadoop.hdfs.reader.HDFSParquetFileReader.IGNORED".

S3ParquetFileWrapper base view edition

View schema

The execution of the wrapper returns the values contained in the file.

View results

Queries optimization

  1. Projection push down

Denodo will read only the selected columns from the Parquet file, avoiding reading columns unnecessarily.

  1. Predicate push down

Denodo will evaluate filtering predicates in the query against metadata stored in the Parquet files. This avoids reading large amounts of chunks of data improving query performance.

  1. Supported data types are:

  • BINARY (UTF8, JSON, BSON), BOOLEAN, DOUBLE, FLOAT, INT32 (DECIMAL, DATE), INT64

and operators:

  • eq, ne, lt, le, gt, ge

  1. Partition pruning

This optimization is possible when the Parquet dataset is splitted across multiple directories, with each value of the partition column stored in a subdirectory, e.g. /requestsPartitioned/responseCode=500/

Denodo can omit large amounts of I/O when the partition column is referenced in the WHERE clause.

(H)DFSAvroFileWrapper

Custom wrapper for reading Avro files.

Important

We recommend not to use the (H)DFSAvroFileWrapper to directly access Avro files, as this is a serialization system mainly meant for use by applications running on the Hadoop cluster. Instead, we recommend using an abstraction layer on top of those files like e.g. Hive, Impala, Spark...

Avro is a row-based storage format for Hadoop which is widely used as a serialization platform. Avro stores the data definition (schema) in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format making it compact and efficient.

This custom wrapper datasource need the following parameters:

  • File system URI: A URI whose scheme and authority identify the file system.

  • HDFS: hdfs://<ip>:<port>.

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

  • Azure Data Lake Storage:

adl://<account name>.azuredatalakestore.net/

           For configuring the credentials see Azure Data Lake Storage section.

  • Azure Blob Storage:

wasb[s]://<container>\@<account>.blob.core.windows.net

For configuring the credentials see Azure Blob Storage section.

  • Azure Data Lake Storage Gen 2:

abfs[s]://<filesystem>\@<account>.dfs.core.windows.net

For configuring the credentials see Azure Data Lake Storage Gen 2 section.

  • Google Cloud Storage:

gs://<bucket>

For configuring the credentials see Google Storage section.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the URI contains @ you have to enter \@.

  • Custom core-site.xml file (optional): configuration file that overrides the default core parameters.

  • Custom hdfs-site xml file (optional): configuration file that overrides the default HDFS parameters.

(H)DFSAvroFileWrapper data source edition

Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:

  • File name pattern (optional): If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files, including the sequence of directories they belong to.

For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. employees_jan.avro, employees_feb.avro, … set the File name pattern to (.*)employees_(.*)\\.avro, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:

  • /hr/2019/employees_jan.avro
  • /hr/2019/employees_feb.avro

Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern employees_(.*)\\.avro will not find the sample results, as the file name starts with "/hr...", not "employees...".

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the File name pattern contains \ you have to enter \\.

  • Delete after reading: Requests that the file denoted by the path be deleted when the wrapper terminates. Note that in case the "File name pattern" field contains a regular expression, instead of deleting the whole directory, only the files matching the regular expression will be deleted.

  • Include full path column If selected, the wrapper adds a column in the view with the full path of the file from which the data of every row are obtained.

There are also two parameters that are mutually exclusive:

  • Avro schema path: input path for the Avro schema file or

  • Avro schema JSON: JSON of the Avro schema.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, } in the Avro schema JSON parameter,  you have to escape these characters with \. For example:

\{

  "type": "map",

        "values": \{

        "type": "record",

        "name": "ATM",

        "fields": [

                 \{ "name": "serial_no", "type": "string" \},

                 \{ "name": "location",    "type": "string" \}

        ]\}

\}

(H)DFSAvroFileWrapper base view edition

Content of the /user/cloudera/schema.avsc file:

{"type" : "record",

  "name" : "Doc",

  "doc" : "adoc",

  "fields" : [ {

    "name" : "id",

    "type" : "string"

  }, {

    "name" : "user_friends_count",

    "type" : [ "int", "null" ]

  }, {

    "name" : "user_location",

    "type" : [ "string", "null" ]

  }, {

    "name" : "user_description",

    "type" : [ "string", "null" ]

  }, {

    "name" : "user_statuses_count",

    "type" : [ "int", "null" ]

  }, {

    "name" : "user_followers_count",

    "type" : [ "int", "null" ]

  }, {

    "name" : "user_name",

    "type" : [ "string", "null" ]

  }, {

    "name" : "user_screen_name",

    "type" : [ "string", "null" ]

  }, {

    "name" : "created_at",

    "type" : [ "string", "null" ]

  }, {

    "name" : "text",

    "type" : [ "string", "null" ]

  }, {

    "name" : "retweet_count",

    "type" : [ "int", "null" ]

  }, {

    "name" : "retweeted",

    "type" : [ "boolean", "null" ]

  }, {

    "name" : "in_reply_to_user_id",

    "type" : [ "long", "null" ]

  }, {

    "name" : "source",

    "type" : [ "string", "null" ]

  }, {

    "name" : "in_reply_to_status_id",

    "type" : [ "long", "null" ]

  }, {

    "name" : "media_url_https",

    "type" : [ "string", "null" ]

  }, {

    "name" : "expanded_url",

    "type" : [ "string", "null" ]

  } ] }                        

View schema

The execution of the view returns the values contained in the Avro file specified in

the WHERE clause of the VQL sentence:

 SELECT * FROM avro_ds_file

 WHERE avrofilepath = '/user/cloudera/file.avro'

                                                                                                   

View results

After applying a flattening operation results are as follows.

    

Flattened results

Field Projection

The recommended way for dealing with projections in (H)DFSAvroFileWrapper is by means of the JSON schema parameters:

  • Avro schema path or
  • Avro schema JSON

By giving to the wrapper a JSON schema containing exclusively the fields you are interested in, the reader used by the (H)DFSAvroFileWrapper will return to VDP only these fields, making the select operation faster.

If you configure the parameter Avro schema JSON with only some of the fields of the /user/cloudera/schema.avsc file used in the previous example, like in the example below (notice the escaped characters):

Schema with the selected fields:

\{

  "type" : "record",

  "name" : "Doc",

  "doc" : "adoc",

  "fields" : [ \{

    "name" : "id",

    "type" : "string"

  \}, \{

    "name" : "user_friends_count",

    "type" : [ "int", "null" ]

  \}, \{

    "name" : "user_location",

    "type" : [ "string", "null" ]

  \}, \{

    "name" : "user_followers_count",

    "type" : [ "int", "null" ]

  \}, \{

    "name" : "user_name",

    "type" : [ "string", "null" ]

  \}, \{

    "name" : "created_at",

    "type" : [ "string", "null" ]

  \} ]

\}

the base view in VDP will contain a subset of the previous base view of the example: the ones matching the new JSON schema provided to the wrapper.

Base view with the selected fields

View results with the selected fields

(H)DFSSequenceFileWrapper

Custom wrapper for reading sequence files. 

Sequence files are binary record-oriented files, where each record has a serialized key and a serialized value.

This custom wrapper datasource need the following parameters:

  • File system URI: A URI whose scheme and authority identify the file system.

  • HDFS: hdfs://<ip>:<port>. 

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

  • Azure Data Lake Storage:

adl://<account name>.azuredatalakestore.net/

           For configuring the credentials see Azure Data Lake Storage section.

  • Azure Blob Storage:

wasb[s]://<container>\@<account>.blob.core.windows.net

For configuring the credentials see Azure Blob Storage section.

  • Azure Data Lake Storage Gen 2:

abfs[s]://<filesystem>\@<account>.dfs.core.windows.net

For configuring the credentials see Azure Data Lake Storage Gen 2 section.

  • Google Cloud Storage:

gs://<bucket>

For configuring the credentials see Google Storage section.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

 

E.g if the URI contains @ you have to enter \@.

  • Custom core-site.xml file (optional): configuration file that overrides the default core parameters.

  • Custom hdfs-site xml file (optional): configuration file that overrides the default HDFS parameters.

(H)DFSSequenceFileWrapper data source edition

Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:

  • Path: input path for the sequence file or the directory containing the files.

  • File name pattern (optional): If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files, including the sequence of directories they belong to.

For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. file_1555297166.seq, file_1555300766.seq, … set the File name pattern to (.*)file_(.*)\\.seq, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:

  • /result/file_1555297166.seq
  • /result/file_1555300766.seq

Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern file_(.*)\\.seq will not find the sample results, as the file name starts with "/result...", not "file...".

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the File name pattern contains \ you have to enter \\.

  • Delete after reading: Requests that the file or directory denoted by the path be deleted when the wrapper terminates. Note that in case the "File name pattern" field contains a regular expression, instead of deleting the whole directory, only the files matching the regular expression will be deleted.

  • Include full path column If selected, the wrapper adds a column in the view with the full path of the file from which the data of every row are obtained.

  • Key class: key class name implementing org.apache.hadoop.io.Writable interface.

  • Value class: value class name implementing org.apache.hadoop.io.Writable interface.

(H)DFSSequenceFileWrapper base view edition

View schema

The execution of the wrapper returns the key/value pairs contained in the file or group of files, if the Path input parameter denotes a directory.

View results

(H)DFSMapFileWrapper

Custom wrapper for reading map files.

A map is a directory containing two sequence files. The data file (/data) is identical to the sequence file and contains the data stored as binary key/value pairs. The index file (/index), which contains a key/value map with seek positions inside the data file to quickly access the data.

 

Map file format

This custom wrapper datasource need the following parameters:

  • File system URI: A URI whose scheme and authority identify the file system.

  • HDFS: hdfs://<ip>:<port>. 

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

  • Azure Data Lake Storage:

adl://<account name>.azuredatalakestore.net/

           For configuring the credentials see Azure Data Lake Storage section.

  • Azure Blob Storage:

wasb[s]://<container>\@<account>.blob.core.windows.net

For configuring the credentials see Azure Blob Storage section.

  • Azure Data Lake Storage Gen 2:

abfs[s]://<filesystem>\@<account>.dfs.core.windows.net

For configuring the credentials see Azure Data Lake Storage Gen 2 section.

  • Google Cloud Storage:

gs://<bucket>

For configuring the credentials see Google Storage section.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the URI contains @ you have to enter \@.

  • Custom core-site.xml file (optional): configuration file that overrides the default core parameters.

  • Custom hdfs-site xml file (optional): configuration file that overrides the default HDFS parameters.

(H)DFSMapFileWrapper data source edition

Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:

  • Path: input path for the directory containing the map file. Also the path to the index or data file could be specified.  When using S3, a flat file system where there is no folder concept, the path to the index or data should be used.

  • File name pattern (optional): If you want this wrapper to only obtain data from some of the files of the directory, you can enter a regular expression that matches the names of these files, including the sequence of directories they belong to.

For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. invoice_jan.whatever, invoice_feb.whatever, … set the File name pattern to (.*)invoice_(.*)\\.whatever, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:

  • /accounting/invoices/2019/invoice_jan.whatever
  • /accounting/invoices/2019/invoice_feb.whatever

Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern invoice_(.*)\\.whatever will not find the sample results, as the file name starts with "/accounting...", not "invoice...".

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the File name pattern contains \ you have to enter \\.

  • Delete after reading: Requests that the file or directory denoted by the path be deleted when the wrapper terminates. Note that in case the "File name pattern" field contains a regular expression, instead of deleting the whole directory, only the files matching the regular expression will be deleted.

  • Include full path column If selected, the wrapper adds a column in the view with the full path of the file from which the data of every row are obtained.

  • Key class: key class name implementing the

org.apache.hadoop.io.WritableComparable interface. WritableComparable is used because records are sorted in key order.

  • Value class: value class name implementing the

org.apache.hadoop.io.Writable interface.

(H)DFSMapFileWrapper base view edition

View schema

The execution of the wrapper returns the key/value pairs contained in the file or group of files, if the Path input parameter denotes a directory.

View results

WebHDFSFileWrapper

Warning

WebHDFSFileWrapper is deprecated.

  • For XML, JSON and Delimited files the best alternative is using the VDP standard data sources, using the HTTP Client in its Data route parameter. These data sources offer a better solution for HTTP/HTTPs access as they include proxy access, SPNEGO authentication, OAuth2 etc.

  • For Avro, Sequence, Map and Parquet files the best alternative is using the specific custom wrapper type:

HDFSAvroFileWrapper, HDFSSequenceFileWrapper, HDFSMapFileWrapper or HDFSParquetFileWrapper with webhdfs scheme in their File system URI parameter. And placing their credentials in the xml configuration files.

DFSListFilesWrapper

Custom wrapper to retrieve file information from a distributed file system.

This custom wrapper datasource need the following parameters:

  • File system URI: A URI whose scheme and authority identify the file system.

  • HDFS: hdfs://<ip>:<port>. 

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

  • Azure Data Lake Storage:

adl://<account name>.azuredatalakestore.net/

           For configuring the credentials see Azure Data Lake Storage section.

  • Azure Blob Storage:

wasb[s]://<container>\@<account>.blob.core.windows.net

For configuring the credentials see Azure Blob Storage section.

  • Azure Data Lake Storage Gen 2:

abfs[s]://<filesystem>\@<account>.dfs.core.windows.net

For configuring the credentials see Azure Data Lake Storage Gen 2 section.

  • Google Cloud Storage:

gs://<bucket>

For configuring the credentials see Google Cloud Storage section.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the URI contains @ you have to enter \@.

  • Custom hdfs-site xml file (optional): configuration file that overrides the default parameters, like the credentials ones.

DFSDListFilesFilesWrapper data source edition

The entry point for querying the wrapper is the parameter parentfolder. The wrapper will list the files that are located in this supplied directory. It is possible to do this in a recursive way, retrieving also the contents of the subfolders, setting the parameter recursive to true.

Execution panel

The schema of the custom wrapper contains the following columns:

  • parentfolder: the path of the parent directory. The wrapper will list all the files located in this directory

  • This parameter is mandatory in a SELECT operation

  • relativepath: the location of the file respect of the parentfolder. Useful when executing a recursive query.

  • filename: the name of the file or folder, including the extension for files.

  • extension: the extension of the file. It will be null if the file is a directory.

  • fullpath: the full path of the file with the scheme information.

  • pathwithouscheme: the full path of the file without the scheme information.

  • filetype: either ‘file’ or ‘directory’.

  • encrypted:  true if the file is encrypted, false otherwise.

  • datemodified: the modification time of the file in milliseconds since January 1, 1970 UTC.

  • owner: the owner of the file.

  • group: the group associated with the file.

  • permissions: the permissions of the file, using the symbolic notation (rwxr-xr-x).

  • size: the size of the file in bytes. It will be null for folders.

  • recursive: if false the search for files will be limited to the files that are direct children of the parentfolder. If true, the search will be done recursively, including subfolders of parentfolder.

  • This parameter is mandatory in a SELECT operation

View schema

The following VQL sentence returns the files in the ‘/user/cloudera’ hdfs directory, recursively:

SELECT * FROM listing_dfs

WHERE parentfolder = '/user/cloudera' AND recursive = true

View results

You can filter our query a bit more and retrieve only those files that were modified after '2018-09-01':

SELECT * FROM listing_dfs

WHERE parentfolder = '/user/cloudera' AND recursive = true

AND datemodified > DATE '2018-09-01'

View results

Extending capabilities with the DFSListFilesWrapper

The wrappers of this distribution that reads file formats like DelimitedFiles, Parquet, Avro, Sequence or Map, would increase their capabilities when combined with the DFSListFilesWrapper.

As all of these wrappers need an input path for the file or the directory that is going to be read, you can use the DFSListFilesWrapper for retrieving the file paths that you are interested in, according to some attribute value of their metadata, e.g. modification time.

For example, suppose that you want to retrieve the files in the /user/cloudera/df/awards directory that were modified in November.

The following steps explain how to configure this scenario:

  1. Create a DFSListFilesWrapper base view that will list the files of the /user/cloudera/df/awards directory.

  1. Create an HDFSDelimitedTextFileWrapper base view that will read the content of the csv files.

Parameterize the Path of the base view by adding an interpolation variable to its value, e.g. @path, (@ is the prefix that identifies a value parameter as an interpolation variable). 

By using the variable @path, you do not have to provide the final path value when creating the base view. Instead, the values of the Path parameter will be provided at runtime by the DFSListFilesWrapper view through the join operation (configured in the next step).

  1. Create a derived view joining the two previously created views. The join condition will be:

DFSListFilesWrapper.pathwithoutscheme = HDFSDelimitedTextFileWrapper.path

  1. By executing the join view with these conditions:

SELECT * FROM join:view

WHERE recursive = true

      AND parentfolder = '/user/cloudera/df/awards'

      AND datemodified > DATE '2018-11-1'

you obtain data only from the delimited files that were modified in November.

S3

The Distributed File System Custom Wrapper can access data stored in S3 with the following Hadoop FileSystem clients:

  • s3a 

Compatible with files created by the older s3n:// client and Amazon EMR’s s3:// client.

Configuring S3A authentication properties

S3A supports several authentication mechanisms. By default the custom wrapper will search for credentials in the following order:

  1. In the configuration files.

For using this authentication method, declare the credentials (access and secret keys) in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>fs.s3a.access.key</name>

  <value>YOUR ACCESS KEY ID</value>

</property>

<property>

  <name>fs.s3a.secret.key</name>

  <value>YOUR SECRET ACCESS KEY</value>

</property>

</configuration>

  1. Then, the environment variables named AWS_ACCESS_KEY_ID and

AWS_SECRET_ACCESS_KEY are looked for.

  1. Otherwise, an IAM role will be used to retrieve a set of temporary credentials.

 

An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs. This mechanism is available only when running your application on an Amazon EC2 instance and there is an IAM role associated with the instance, but provides the greatest ease of use and best security when working with Amazon EC2 instances.

Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.

Temporary Security Credentials

Temporary Security Credentials can be obtained from the Amazon Security Token Service; these consist of an access key, a secret key, and a session token.

To authenticate with these:

  1. Declare org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as the provider.

  1. Set the session key in the property fs.s3a.session.token, and the access and secret key properties to those of this temporary session.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>fs.s3a.aws.credentials.provider</name>

  <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider

</value>

</property>

<property>

  <name>fs.s3a.access.key</name>

  <value>YOUR ACCESS KEY ID</value>

</property>

<property>

  <name>fs.s3a.secret.key</name>

  <value>YOUR SECRET ACCESS KEY</value>

</property>

<property>

  <name>fs.s3a.session.token

</name>

  <value>SECRET-SESSION-TOKEN</value>

</property>

</configuration>

The lifetime of session credentials are fixed when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS, so you must get a new set of credentials.

Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.

Using IAM Assumed Roles

To use assumed roles, the wrapper must be configured to use the Assumed Role Credential Provider, org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider, in

the configuration option fs.s3a.aws.credentials.provider in the wrapper configuration file Custom core-site.xml.

This Assumed Role Credential provider will read in the fs.s3a.assumed.role.* options needed to connect to the Session Token Service Assumed Role API:

  1. First authenticating with the full credentials. This means the normal

fs.s3a.access.key and fs.s3a.secret.key pair, environment variables, or some other supplier of long-lived secrets.

        

If you wish to use a different authentication mechanism, other than

org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, set it in the property fs.s3a.assumed.role.credentials.provider.

  1. Then assuming the specific role specified in fs.s3a.assumed.role.arn

  1. It will then refresh this login at the configured rate in fs.s3a.assumed.role.session.duration.

Below you can see the properties required for configuring IAM Assumed Roles in this custom wrapper, using its configuration file, Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>fs.s3a.aws.credentials.provider</name>

  <value>org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider</value>

  <value>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider</value>

</property>

<property>

  <name>fs.s3a.assumed.role.arn</name>

  <value>YOUR AWS ROLE</value>

  <description>

    AWS ARN for the role to be assumed. Required if the      

    fs.s3a.aws.credentials.provider contains

    org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider

  </description>

</property>

<property>

  <name>fs.s3a.assumed.role.credentials.provider</name>

<value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value>

  <description>

    List of credential providers to authenticate with the

    STS endpoint and retrieve short-lived role credentials.

    Only used if AssumedRoleCredentialProvider is the AWS credential  

    Provider. If unset, uses  

    "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider".

  </description>

</property>

<property>

  <name>fs.s3a.assumed.role.session.duration</name>

  <value>30m</value>

  <description>

    Duration of assumed roles before a refresh is attempted.

    Only used if AssumedRoleCredentialProvider is the AWS credential

    Provider.

    Range: 15m to 1h

  </description>

</property>

<property>

  <name>fs.s3a.access.key</name>

  <value>YOUR ACCESS KEY ID</value>

</property>

<property>

  <name>fs.s3a.secret.key</name>

  <value>YOUR SECRET ACCESS KEY</value>

</property>

</configuration>

Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.

S3 Compatible Storage properties

Bucket names cannot contain underscores, see S3 naming conventions.

The property fs.s3a.endpoint is mandatory when accessing S3 compatible storage, that is, different from Amazon S3.

The property fs.s3a.path.style.access could be mandatory, depending on whether the virtual host style addressing or path style addressing is being used (by default the host style is enabled):

  • Host style: http://bucket.endpoint/object
  • Path style: http://endpoint/bucket/object

Also note in the configuration below, that SSL could be enabled or disabled. If it is enabled, the Denodo server has to be configured to validate the SSL certificate.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>fs.s3a.endpoint</name>

  <value>IP ADDRESS TO CONNECT TO</value>

</property>

<property>

  <name>fs.s3a.path.style.access</name>

  <value>true/false</value>

  <description>Enables S3 path style access that is disabling the    

default virtual hosting behavior(default: false)</description>

</property>

<property>

  <name>fs.s3a.access.key</name>

  <value>YOUR ACCESS KEY ID</value>

</property>

<property>

  <name>fs.s3a.secret.key</name>

  <value>YOUR SECRET ACCESS KEY</value>

</property>

<property>

  <name>fs.s3a.connection.ssl.enabled</name>

  <value>true/false</value>

  <description>Enables or disables SSL connections to S3 (default: true)</description>

</property>

</configuration>

Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.

Signature Version 4 support

Connecting to AWS regions that only support V4 of the AWS Signature protocol (those created since January 2014) will require the explicit region endpoint URL to be specified. This is done in the configuration option fs.s3a.endpoint in the Custom core-site.xml parameter of the wrapper, or at the corresponding data source configuration input when using an S3-specific wrapper implementation. You can use the core-site.xml, located in the conf folder of the distribution, as a guide. Otherwise a Bad Request exception could be thrown.

As an example of configuration, the endpoint for S3 Frankfurt is

S3.eu-central-1.amazonaws.com:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>fs.s3a.endpoint</name>

  <value>s3.eu-central-1.amazonaws.com</value>

  <description>AWS S3 endpoint to connect to. An up-to-date list is
   provided in the AWS Documentation: regions and endpoints. Without

  this property, the standard region (s3.amazonaws.com) is assumed.
 </
description>

</property>

</configuration>

You can find the full list of supported versions for AWS Regions in their website: Amazon Simple Storage Service (Amazon S3).

Connecting to AWS PrivateLink VPC Interface Endpoints

When connecting to Amazon S3 via an VPC Interface Endpoint the endpoint URL needs to be explicitly specified as explained above, and it will have a format such as https://bucket.vpce-xxxxx.s3.eu-west-1.vpce.amazonaws.com. In these cases, besides the endpoint URL, the region to connect to also needs to be explicitly specified by means of the fs.s3a.endpoint.region property in the core-site.xml configuration file:

<property>

  <name>fs.s3a.endpoint.region</name>

  <value>eu-west-1</value>

</property>

Also if you want to assume a role it's necessary to define a sts VPC Interface endpoint for this service and specify the fs.s3a.assumed.role.sts.endpoint and fs.s3a.assumed.role.sts.endpoint.region properties in the core-site.xml configuration file:

<property>

  <name>fs.s3a.assumed.role.sts.endpoint</name>

  <value>vpce-xxxxx.sts.us-east-1.vpce.amazonaws.com</value>

</property>

<property>

  <name>fs.s3a.assumed.role.sts.endpoint.region</name>

  <value>us-east-1</value>

</property>

Azure Data Lake Storage

The Distributed File System Custom Wrapper can access data stored in Azure Data Lake Storage.

Configuring authentication properties

Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

     <name>fs.adl.oauth2.access.token.provider.type</name>

     <value>ClientCredential</value>

 </property>

 <property>

     <name>fs.adl.oauth2.refresh.url</name>

     <value>YOUR TOKEN ENDPOINT</value>

 </property>

 <property>

     <name>fs.adl.oauth2.client.id</name>

     <value>YOUR CLIENT ID</value>

 </property>

 <property>

     <name>fs.adl.oauth2.credential</name>

     <value>YOUR CLIENT SECRET</value>

 </property>

 </configuration>

Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.

Warning

TLS 1.0 and 1.1 support will be removed for new & existing Azure storage accounts  starting Nov 2024. In addition, recent TLS 1.3 support in Azure storage accounts may cause connections to Azure storage using SSL/TLS to fail. This could cause a timeout. In that case, include the following JVM parameters to specify the TLS versions Virtual DataPort should allow excluding version 1.3. For instance:

-Dhttps.protocols="TLSv1,TLSv1.1,TLSv1.2" -Djdk.tls.client.protocols="TLSv1,TLSv1.1,TLSv1.2"

Visit the official  Azure documentation for more information about this issue.

Azure Blob Storage

The Distributed File System Custom Wrapper can access data stored in Azure Blob Storage.

Configuring authentication properties

Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

     <name>fs.azure.account.key.<account>.blob.core.windows.net</name>

     <value>YOUR ACCESS KEY</value>

  </property>

</configuration>

Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.

Azure Data Lake Storage Gen 2

Since the Distributed File System Custom Wrapper for Denodo 7.0, (as this functionality requires Java 8), this wrapper can access data stored in Azure Data Lake Storage Gen 2.

By default ADLS Gen2 uses TLS, both with abfs:// and abfss://. When you set the fs.azure.always.use.https=false property, TLS is disabled with abfs://, and TLS is enabled with abfss://.

Configuring authentication properties

To configure the authentication properties place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.

You can choose between these two authentication methods:

  1. OAuth 2.0:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>fs.azure.account.auth.type</name>

  <value>OAuth</value>

</property>

<property>

  <name>fs.azure.account.oauth.provider.type</name>

  <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>

</property>

<property>

  <name>fs.azure.account.oauth2.client.endpoint</name>

  <value>URL of OAuth endpoint</value>

</property>

<property>

  <name>fs.azure.account.oauth2.client.id</name>

  <value>CLIENT-ID</value>

</property>

<property>

  <name>fs.azure.account.oauth2.client.secret</name>

  <value>SECRET</value>

</property>

</configuration>

  1. Shared Key: using the storage account’s authentication secret:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

     <name>fs.azure.account.key.<account>.dfs.core.windows.net</name>

     <value>YOUR ACCOUNT KEY</value>

  </property>

</configuration>

Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.

Google Cloud Storage

Since the Distributed File System Custom Wrapper for Denodo 7.0, (as this functionality requires Java 8), this wrapper can access data stored in Google Cloud Storage.

Configuring authentication properties

Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

   

 <property>

   <name>google.cloud.auth.service.account.enable</name>

   <value>true</value>

  <description>Whether to use a service account for GCS authorization.      

   If an email and keyfile are provided then that service account

  will be used. Otherwise the connector will look to see if it running

   On a GCE VM with some level of GCS access in its service account  

   scope, and use that service account.</description>

 </property>

 <property>

   <name>google.cloud.auth.service.account.json.keyfile</name>

   <value>/PATH/TO/KEYFILE</value>

   <description>The JSON key file of the service account used for GCS

   access when google.cloud.auth.service.account.enable is  

   true.</description>

 </property>

</configuration>

Permissions

Wrappers that read files content from Google Cloud Storage, like (H)DFSDelimitedTextFileWrapper, (H)DFSAvroFileWrapper, etc. requires accessing with storage.objects.get permissions.

The DFSListFilesWrapper, as it lists files from buckets, requires accessing with

storage.buckets.get permissions.

For more information on roles and permissions see https://cloud.google.com/storage/docs/access-control/iam-roles.

Alibaba OSS (Object Storage Service)

The Distributed File System Custom Wrapper can access data stored in Alibaba Object Storage Service (OSS). OSS is compatible with S3 but this custom wrapper doesn’t support IAM roles.

OSS using S3ParquetFileWrapper

It is possible to access the Object Storage Service (OSS) using S3ParquetFileWrapper. The following parameters are required:

  • File system URI: A URI whose scheme and authority identify the file system.

  • S3: s3a://<bucket>.

<bucket> cannot contain underscores, see S3 naming conventions.

For configuring the credentials see S3 section.

  • Access Key ID (optional): The access Key ID using s3a. This parameter sets the fs.s3a.access.key parameter.

  • Secret Access Key (optional): The Secret Access Key using s3a. This parameter sets the fs.s3a.secret.key parameter.

  • Endpoint (optional): The S3 endpoint using s3a. This parameter sets the fs.s3a.endpoint parameter.

Endpoint is mandatory.



! Note

IAM roles are not supported by Denodo Distributed File System Custom Wrapper. You can´t use the parameter IAM Role to Assume and access using an assumed role.

Configuring authentication properties

For access to OSS, declare the credentials (access key, secret key and endpoint) in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

  <name>fs.s3a.endpoint</name>

  <value>IP ADDRESS TO CONNECT TO</value>

</property>

<property>

  <name>fs.s3a.access.key</name>

  <value>YOUR ACCESS KEY ID</value>

</property>

<property>

  <name>fs.s3a.secret.key</name>

  <value>YOUR SECRET ACCESS KEY</value>

</property>

</configuration>

Limitations

Authentication assuming a RAM role is not supported by Denodo Distributed File System Custom Wrapper. It is not posible configure an assumed role neither with IAM role to asume Parameter in  S3ParquetFileWrapper, nor  using the fs.s3a.assumed.role.arn property in the file core-site.xml.

Compressed Files

The Distributed File System Custom Wrapper transparently read compressed files in any of these compression formats:

  • gzip
  • DEFLATE (zlib)
  • bzip2
  • snappy
  • LZO
  • LZ4
  • Zstandard

Secure cluster with Kerberos

The configuration required for accessing a Hadoop cluster with Kerberos enabled is the same as the one needed to access the distributed file system and, additionally, the user must supply the Kerberos credentials.

The Kerberos parameters are:

  • Kerberos enabled: Check it when accessing a Hadoop cluster with Kerberos enabled.

  • Kerberos principal name (optional): Kerberos v5 Principal name, e.g. primary/instance\@realm.

! Note

If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, },  you have to escape these characters with \.

E.g if the Kerberos principal name contains @ you have to enter \@.

  • Kerberos keytab file (optional): Keytab file containing the key of the Kerberos principal.

  • Kerberos password (optional): Password associated with the principal.

  • Kerberos Distribution Center (optional): Kerberos Key Distribution Center.

The Distributed File System Custom Wrapper provides three ways for accessing a kerberized Hadoop cluster:

  1. The client has a valid Kerberos ticket in the ticket cache obtained, for example, using the kinit command in the Kerberos Client.

In this case only the Kerberos enabled parameter should be checked. The  wrapper would use the Kerberos ticket to authenticate itself against the Hadoop cluster.

  1. The client does not have a valid Kerberos ticket in the ticket cache. In this case you should provide the Kerberos principal name parameter and

  1. Kerberos keytab file parameter or
  2. Kerberos password parameter.

In all these three scenarios the krb5.conf file should be present in the file system. Below there is an example of the Kerberos configuration file:

[libdefaults]

  renew_lifetime = 7d

  forwardable = true

  default_realm = EXAMPLE.COM

  ticket_lifetime = 24h

  dns_lookup_realm = false

  dns_lookup_kdc = false

[domain_realm]

  sandbox.hortonworks.com = EXAMPLE.COM

  cloudera = CLOUDERA

[realms]

  EXAMPLE.COM = {

    admin_server = sandbox.hortonworks.com

    kdc = sandbox.hortonworks.com

  }

 CLOUDERA = {

  kdc = quickstart.cloudera

  admin_server = quickstart.cloudera

  max_renewable_life = 7d 0h 0m 0s

  default_principal_flags = +renewable

 }

[logging]

  default = FILE:/var/log/krb5kdc.log

  admin_server = FILE:/var/log/kadmind.log

  kdc = FILE:/var/log/krb5kdc.log

The algorithm to locate the krb5.conf file is the following:

  • If the system property java.security.krb5.conf is set, its value is assumed to specify the path and file name.

  • If that system property value is not set, then the configuration file is looked for in the directory

  • <java-home>\lib\security (Windows)
  • <java-home>/lib/security (Solaris and Linux)
  • If the file is still not found, then an attempt is made to locate it as follows:

  • /etc/krb5/krb5.conf (Solaris)
  • c:\winnt\krb5.ini (Windows)
  • /etc/krb5.conf (Linux)

There is an exception. If you are planning to create VDP views that use the same Key Distribution Center and the same realm the Kerberos Distribution Center parameter can be provided instead of having the krb5.conf file in the file system.

Data source edition

Troubleshooting

Symptom

Error message: “org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block”.

Resolution

Add this property to the Custom hdfs-site.xml file:

<property>

<name>dfs.client.use.datanode.hostname</name>

<value>true</value>

</property>

Symptom

Error message: “SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]”.

Resolution

You are trying to connect to a Kerberos-enabled Hadoop cluster. You should configure the custom wrapper accordingly. See Secure cluster with Kerberos section for configuring Kerberos on this custom wrapper.

Symptom

Error message: “Cannot get Kerberos service ticket: KrbException: Server not found in Kerberos database (7) ”.

Resolution

Check that nslookup is returning the fully qualified hostname of the KDC. If not, modify the /etc/hosts of the client machine for the KDC entry to be of the form "IP address fully.qualified.hostname alias".

Symptom

Error message: “This authentication mechanism is no longer supported".

Resolution

The method of placing credentials in the URL, s3a://<id>:<secret>@<bucket>, is discouraged. Configure the credentials on the core-site.xml instead (see S3 section).

Symptom

Error message: “Could not initialize class org.xerial.snappy.Snappy

Resolution

On Linux platforms, an error may occur when Snappy compression/decompression is enabled although its native library is available from the classpath.

The native library snappy-<version>-libsnappyjava.so for Snappy compression is included in the snappy-java-<version>.jar file. When the JVM initializes the JAR, the library is added to the default temp directory. If the default temp directory is mounted with noexec option, it results in the above exception.

To solve it you have to specify, as a JVM option of the Denodo server, a different temp directory, that has already been mounted without the noexec option:

 -Dorg.xerial.snappy.tempdir=/path/to/newtmp

Symptom

Error message: “Unable to find a region via the region provider chain

Resolution

Some Amazon S3-compatible storage systems, as well as some endpoint URL formats, prevent the client libraries used by the wrapper to adequately determine the region that the bucket to be accessed lives in.

In such cases, automatic region resolution can be overridden by means of specifying the fs.s3a.endpoint.region property in the core-site.xml configuration file:

 

<property>

  <name>fs.s3a.endpoint.region</name>

  <value>eu-west-1</value>

</property>

Symptom

Error message: “The authorization header is malformed; the region 'vpce' is wrong; expecting '<region>' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed)

Resolution

When using an Interface VPC Endpoint from Amazon AWS PrivateLink, the endpoint URLs to be used have a different format than the standard Amazon S3 endpoint URLs (like for example https://bucket.vpce-xxxxx.s3.eu-west-1.vpce.amazonaws.com), and the client libraries used by the wrapper will not be able to correctly determine the region to connect to.

In such cases, the region needs to be explicitly specified in the wrapper’s configuration by means of the fs.s3a.endpoint.region property in the core-site.xml file:

 

<property>

  <name>fs.s3a.endpoint.region</name>

  <value>eu-west-1</value>

</property>

Symptom

Query Timeout was reached accessing Azure Storage.

Resolution


TLS 1.3 support in Azure storage accounts may cause connections to Azure storage using SSL/TLS to fail.

Include the following JVM parameters to specify the TLS versions Virtual DataPort should allow excluding version 1.3. For instance:

-Dhttps.protocols="TLSv1,TLSv1.1,TLSv1.2" -Djdk.tls.client.protocols="TLSv1,TLSv1.1,TLSv1.2"

Appendices

How to use the Hadoop vendor’s client libraries

In some cases, it is advisable to use the libraries of the Hadoop vendor you are connecting to (Cloudera, Hortonworks, …), instead of the Apache Hadoop libraries distributed in this custom wrapper.

In order to use the Hadoop vendor libraries there is no need to import the Distributed File System Custom Wrapper as an extension as it is explained in the Importing the custom wrapper into VDP section.

You have to create the custom data sources using the Classpath parameter instead of the ‘Select Jars option.

Click Browse to select the directory containing the required dependencies for this custom wrapper, that is:

  • The denodo-hdfs-customwrapper-${version}.jar file of the dist directory of this custom wrapper distribution (highlighted in orange in the image below).

  • The contents of the lib directory of  this custom wrapper distribution, replacing the Apache Hadoop libraries with the vendor specific ones (highlighted in blue in the image below, the suffix indicating that they are Cloudera jars).

        

        Here you can find the libraries for Cloudera and Hortonworks Hadoop distributions:

  • Hortonworks repository:

http://repo.hortonworks.com/content/repositories/releases/org/apache/hadoop/

C:\Work\denodo-hdfs-libs directory

Distributed File System Data Source

! Note

When clicking Browse, you will browse the file system of the host where the Server is running and not where the Administration Tool is running.

How to connect to MapR XD (MapR-FS)

From MapR documentation: “MapR XD Distributed File and Object Store manages both structured and unstructured data. It is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform.

As MapR XD supports HDFS-compatible API, you can use the DFS Custom Wrapper to connect to MapR FileSystem. This section explains how to do that.

! Important

Please check MapR’s Java Compatibility matrix to verify that your version of MapR supports working with the version of the JVM used by your installation of the Denodo Platform.

Note that MapR 6.1.x and previous versions do not support Java 11, which is required by Denodo 8.

Install MapR Client

To connect to the MapR cluster you need to install the MapR Client on your client machine (where the VDP server is running):

 

  • Verify that the operating system on the machine where you plan to install the MapR Client is supported, see MapR Client Support Matrix.

Set $MAPR_HOME environment variable to the directory where MapR client was installed. If the MAPR_HOME environment variable is not defined /opt/mapr is the default path.

Copy mapr-clusters.conf file

Copy mapr-clusters.conf from the MapR cluster to the $MAPR_HOME/conf folder in the VDP machine.

demo.mapr.com secure=true maprdemo:7222

Generate MapR ticket (secure clusters only)

Every user who wants to access a secure cluster must have a MapR ticket (maprticket_<username>) in the temporary directory (the default location).

Use the $MAPR_HOME/maprlogin command line tool to generate one:

C:\opt\mapr\bin>maprlogin.bat password -user mapr

[Password for user 'mapr' at cluster 'demo.mapr.com': ]

MapR credentials of user 'mapr' for cluster 'demo.mapr.com' are written to 'C:\Users\<username>\AppData\Local\Temp/maprticket_<username>'

! Note

If you get an error like

java.security.InvalidAlgorithmParameterException: the trustAnchors parameter must be non-empty when executing maprlogin

you need to specify a truststore before executing the maprlogin command.

For this, you can copy the /opt/mapr/ssl_truststore from MapR Cluster to $MAPR_HOME/conf directory in the local machine.

Add JVM option

Add -Dmapr.library.flatclass to the VDP Server JVM options.

VDP Server JVM options

Otherwise, VDP will throw the exception java.lang.UnsatisfiedLinkError from JNISecurity.SetParsingDone() while executing the Kafka Custom Wrapper.

Create custom data source

In order to use the MapR vendor libraries you should not import the DFS Custom Wrapper into Denodo.

You have to create the custom data source using the ‘Classpath’ parameter instead of the ‘Select Jars’ option. Click Browse to select the directory containing the required dependencies for this custom wrapper:

  • The denodo-hdfs-customwrapper-${version}.jar file of the dist directory of this custom wrapper distribution.

  • The contents of the lib directory of  this custom wrapper distribution, replacing the Apache Hadoop libraries with the MapR ones.

The MapR Maven repository is located at http://repository.mapr.com/maven/. The name of the JAR files that you must use contains the version of Hadoop, Kafka, Zookeeper and MapR that you are using:

  • hadoop-xxx-<hadoop_version>-<mapr_version>
  • maprfs-<mapr_version>

As MapRClient native library is bundled in maprfs-<mapr_version> jar you should use the maprfs jar that comes with the Mapr Client, previously installed, as the library is dependent on the operating system.

  • zookeeper-<zookeeper_version>-<mapr_version>
  • json-<version>
  • the other dependencies of the lib directory of  this custom wrapper distribution

! Important

MapR native library is included in these Custom Wrapper dependencies and can be loaded only once.

Therefore, if you plan to access to other MapR sources with Denodo, like:

  • MapR Database with HBase Custom Wrapper
  • MapR Event Store with Kafka Custom Wrapper
  • Drill with JDBC Wrapper.

you have to use the same classpath to configure all the custom wrappers and the JDBC driver; see  'C:\Work\MapR Certification\mapr-lib' in the image above.

With this configuration Denodo can reuse the same classloader and load the native library only once.

Configure data source and base view

Configure the DFS wrapper parameters as usual:

MapR data source edition

MapR base view edition