Warning |
Although this wrapper is capable of reading files stores in HDFS, S3, Azure Blob Storage, Azure Data Lake Storage and Google Cloud Storage, most of the technical artifacts of this wrapper, for Denodo 6.0 and Denodo 7.0, include HDFS in their names for legacy compatibility:
As for Denodo 8.0 artifacts names have been rebranded:
|
The Distributed File System Custom Wrapper distribution contains Virtual DataPort custom wrappers capable of reading several file formats stored in HDFS, S3, Azure Data Lake Storage, Azure Blob Storage, Azure Data Lake Storage Gen 2 and Google Cloud Storage.
Supported formats are:
Also, there is a custom wrapper to retrieve information from the distributed file system and display it in a relational way:
The Distributed File System Custom Wrapper distribution consists of:
In order to use the Distributed File System Custom Wrapper in VDP, you must add it as an extension using the Admin Tool.
From the Distributed File System Custom Wrapper distribution, you will select the denodo-(h)dfs-customwrapper-${version}-jar-with-dependencies.jar file and upload it to VDP. No other jars are required as this one will already contain all the required dependencies.
Important |
As the jar-with-dependencies version of this wrapper contains the Hadoop client libraries, increasing the JVM's heap space for VDP Admin Tool is required to avoid a Java heap space when uploading the jar to VDP. |
Distributed File System extension in VDP
Once the custom wrapper jar file has been uploaded to VDP, you can create new data sources for this custom wrapper --and their corresponding base views-- as usual.
Go to New → Data Source → Custom and specify one of the possible wrappers:
Also check ‘Select Jars’ and choose the jar file of the custom wrapper.
Distributed File System Data Source
Depending on the selected wrapper you will have different input parameters. To update the parameters, you must press the refresh button.
Custom wrapper for reading delimited text files.
Delimited text files store plain text and each line has values separated by a delimiter, such as tab, space, comma, etc.
This custom wrapper datasource need the following parameters:
<bucket> cannot contain underscores, see S3 naming conventions.
For configuring the credentials see S3 section.
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb[s]://<container>\@<account>.blob.core.windows.net
For configuring the credentials see Azure Blob Storage section.
abfs[s]://<filesystem>\@<account>.dfs.core.windows.net
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
(H)DFSDelimitedTextFileWrapper data source edition
Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. invoice_jan.csv, invoice_feb.csv, … set the File name pattern to (.*)invoice_(.*)\\.csv, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern invoice_(.*)\\.csv will not find the sample results, as the file name starts with "/accounting...", not "invoice...".
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
Some “invisible” characters have to be entered in a special way:
Character |
Meaning |
\t |
Tab |
\f |
Formfeed |
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the separator is the tab character \t you have to enter \\t. |
! Note |
When a separator larger than one character is broken compatibility with the standard comma-separated-value cannot be kept and therefore parameters Quote, Comment Marker, Escape, Null value and Ignore Spaces are not supported. |
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the null value is \N you have to enter \\N. |
If you clear this check box, the wrapper will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the wrapper has ignored any row in a query in the execution trace, in the attribute “Number of invalid rows”.
(H)DFSDelimitedTextFileWrapper base view edition
View schema
The execution of the wrapper returns the values contained in the file or group of files, if the Path input parameter denotes a directory.
View results
Custom wrapper for reading Parquet files.
Parquet is a column-oriented data store of the Hadoop ecosystem. It provides efficient data compression on a per-column level and encoding schemas.
This custom wrapper datasource need the following parameters:
<bucket> cannot contain underscores, see S3 naming conventions.
For configuring the credentials see S3 section.
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb[s]://<container>\@<account>.blob.core.windows.net
For configuring the credentials see Azure Blob Storage section.
abfs[s]://<filesystem>\@<account>.dfs.core.windows.net
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
(H)DFSParquetFileWrapper data source edition
Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:
! Note |
A directory can be configured in the "Parquet File Path" parameter. |
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. flights_jan.parquet, flights_feb.parquet, … set the File name pattern to (.*)flights_(.*)\\.parquet, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern flights_(.*)\\.parquet will not find the sample results, as the file name starts with "/airport...", not "flights...".
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
Note that this kind of parallelism is supported since Denodo 7.0, due to limitations in prior versions of Parquet libraries.
! Note |
In case multiple instances of views from this data source are executed at the same time (This can be the result of executing an operation like join, union... or simply the simultaneous execution of many instances of the same view). It is important to take into account the size of the thread pool defined in the data source as well as the level of parallelism defined in each view to be executed simultaneously. |
(H)DFSParquetFileWrapper base view edition
View schema
The execution of the wrapper returns the values contained in the file.
View results
Denodo will read only the selected columns from the Parquet file, avoiding reading columns unnecessarily.
Denodo will evaluate filtering predicates in the query against metadata stored in the Parquet files. This avoids reading large amounts of chunks of data improving query performance.
and operators:
This optimization is possible when the Parquet dataset is splitted across multiple directories, with each value of the partition column stored in a subdirectory, e.g. /requestsPartitioned/responseCode=500/
Denodo can omit large amounts of I/O when the partition column is referenced in the WHERE clause.
Custom wrapper for reading Parquet files in S3.
Parquet is a column-oriented data store that provides efficient data compression on a per-column level and encoding schemas.
This wrapper has the same behaviour as (H)DFSParquetFileWrapper but it accesses S3 exclusively, and it is much easier to configure.
This custom wrapper datasource need the following parameters:
<bucket> cannot contain underscores, see S3 naming conventions.
For configuring the credentials see S3 section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
Endpoint is mandatory when accessing S3 compatible storage, that is, different from Amazon S3.
S3ParquetFileWrapper data source edition
You have different options to connect with an S3 bucket:
If you need to configure any other parameter of the S3 connection you can use a Custom core-site.xml file, as explained in the S3 section.
Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:
! Note |
A directory can be configured in the "Parquet File Path" parameter. |
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. flights_jan.parquet, flights_feb.parquet, … set the File name pattern to (.*)flights_(.*)\\.parquet, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern flights_(.*)\\.parquet will not find the sample results, as the file name starts with "/airport...", not "flights...".
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
Note that this kind of parallelism is supported since Denodo 7.0, due to limitations in prior versions of Parquet libraries.
! Note |
In case multiple instances of views from this data source are executed at the same time (This can be the result of executing an operation like join, union... or simply the simultaneous execution of many instances of the same view). It is important to take into account the size of the thread pool defined in the data source as well as the level of parallelism defined in each view to be executed simultaneously. |
S3ParquetFileWrapper base view edition
View schema
The execution of the wrapper returns the values contained in the file.
View results
Denodo will read only the selected columns from the Parquet file, avoiding reading columns unnecessarily.
Denodo will evaluate filtering predicates in the query against metadata stored in the Parquet files. This avoids reading large amounts of chunks of data improving query performance.
and operators:
This optimization is possible when the Parquet dataset is splitted across multiple directories, with each value of the partition column stored in a subdirectory, e.g. /requestsPartitioned/responseCode=500/
Denodo can omit large amounts of I/O when the partition column is referenced in the WHERE clause.
Custom wrapper for reading Avro files.
Important |
We recommend not to use the (H)DFSAvroFileWrapper to directly access Avro files, as this is a serialization system mainly meant for use by applications running on the Hadoop cluster. Instead, we recommend using an abstraction layer on top of those files like e.g. Hive, Impala, Spark... |
Avro is a row-based storage format for Hadoop which is widely used as a serialization platform. Avro stores the data definition (schema) in JSON format making it easy to read and interpret by any program. The data itself is stored in binary format making it compact and efficient.
This custom wrapper datasource need the following parameters:
<bucket> cannot contain underscores, see S3 naming conventions.
For configuring the credentials see S3 section.
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb[s]://<container>\@<account>.blob.core.windows.net
For configuring the credentials see Azure Blob Storage section.
abfs[s]://<filesystem>\@<account>.dfs.core.windows.net
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
(H)DFSAvroFileWrapper data source edition
Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. employees_jan.avro, employees_feb.avro, … set the File name pattern to (.*)employees_(.*)\\.avro, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern employees_(.*)\\.avro will not find the sample results, as the file name starts with "/hr...", not "employees...".
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
There are also two parameters that are mutually exclusive:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, } in the Avro schema JSON parameter, you have to escape these characters with \. For example: \{ "type": "map", "values": \{ "type": "record", "name": "ATM", "fields": [ \{ "name": "serial_no", "type": "string" \}, \{ "name": "location", "type": "string" \} ]\} \} |
(H)DFSAvroFileWrapper base view edition
Content of the /user/cloudera/schema.avsc file:
{"type" : "record", "name" : "Doc", "doc" : "adoc", "fields" : [ { "name" : "id", "type" : "string" }, { "name" : "user_friends_count", "type" : [ "int", "null" ] }, { "name" : "user_location", "type" : [ "string", "null" ] }, { "name" : "user_description", "type" : [ "string", "null" ] }, { "name" : "user_statuses_count", "type" : [ "int", "null" ] }, { "name" : "user_followers_count", "type" : [ "int", "null" ] }, { "name" : "user_name", "type" : [ "string", "null" ] }, { "name" : "user_screen_name", "type" : [ "string", "null" ] }, { "name" : "created_at", "type" : [ "string", "null" ] }, { "name" : "text", "type" : [ "string", "null" ] }, { "name" : "retweet_count", "type" : [ "int", "null" ] }, { "name" : "retweeted", "type" : [ "boolean", "null" ] }, { "name" : "in_reply_to_user_id", "type" : [ "long", "null" ] }, { "name" : "source", "type" : [ "string", "null" ] }, { "name" : "in_reply_to_status_id", "type" : [ "long", "null" ] }, { "name" : "media_url_https", "type" : [ "string", "null" ] }, { "name" : "expanded_url", "type" : [ "string", "null" ] } ] } |
View schema
The execution of the view returns the values contained in the Avro file specified in
the WHERE clause of the VQL sentence:
SELECT * FROM avro_ds_file WHERE avrofilepath = '/user/cloudera/file.avro' |
View results
After applying a flattening operation results are as follows.
Flattened results
The recommended way for dealing with projections in (H)DFSAvroFileWrapper is by means of the JSON schema parameters:
By giving to the wrapper a JSON schema containing exclusively the fields you are interested in, the reader used by the (H)DFSAvroFileWrapper will return to VDP only these fields, making the select operation faster.
If you configure the parameter Avro schema JSON with only some of the fields of the /user/cloudera/schema.avsc file used in the previous example, like in the example below (notice the escaped characters):
Schema with the selected fields:
\{ "type" : "record", "name" : "Doc", "doc" : "adoc", "fields" : [ \{ "name" : "id", "type" : "string" \}, \{ "name" : "user_friends_count", "type" : [ "int", "null" ] \}, \{ "name" : "user_location", "type" : [ "string", "null" ] \}, \{ "name" : "user_followers_count", "type" : [ "int", "null" ] \}, \{ "name" : "user_name", "type" : [ "string", "null" ] \}, \{ "name" : "created_at", "type" : [ "string", "null" ] \} ] \} |
the base view in VDP will contain a subset of the previous base view of the example: the ones matching the new JSON schema provided to the wrapper.
Base view with the selected fields
View results with the selected fields
Custom wrapper for reading sequence files.
Sequence files are binary record-oriented files, where each record has a serialized key and a serialized value.
This custom wrapper datasource need the following parameters:
<bucket> cannot contain underscores, see S3 naming conventions.
For configuring the credentials see S3 section.
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb[s]://<container>\@<account>.blob.core.windows.net
For configuring the credentials see Azure Blob Storage section.
abfs[s]://<filesystem>\@<account>.dfs.core.windows.net
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \.
E.g if the URI contains @ you have to enter \@. |
(H)DFSSequenceFileWrapper data source edition
Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. file_1555297166.seq, file_1555300766.seq, … set the File name pattern to (.*)file_(.*)\\.seq, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern file_(.*)\\.seq will not find the sample results, as the file name starts with "/result...", not "file...".
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
(H)DFSSequenceFileWrapper base view edition
View schema
The execution of the wrapper returns the key/value pairs contained in the file or group of files, if the Path input parameter denotes a directory.
View results
Custom wrapper for reading map files.
A map is a directory containing two sequence files. The data file (/data) is identical to the sequence file and contains the data stored as binary key/value pairs. The index file (/index), which contains a key/value map with seek positions inside the data file to quickly access the data.
Map file format
This custom wrapper datasource need the following parameters:
<bucket> cannot contain underscores, see S3 naming conventions.
For configuring the credentials see S3 section.
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb[s]://<container>\@<account>.blob.core.windows.net
For configuring the credentials see Azure Blob Storage section.
abfs[s]://<filesystem>\@<account>.dfs.core.windows.net
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
(H)DFSMapFileWrapper data source edition
Once the custom wrapper datasource has been registered, you will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. invoice_jan.whatever, invoice_feb.whatever, … set the File name pattern to (.*)invoice_(.*)\\.whatever, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
Note that the File name pattern value takes into account the full path of the file, so in the above example, the pattern invoice_(.*)\\.whatever will not find the sample results, as the file name starts with "/accounting...", not "invoice...".
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
org.apache.hadoop.io.WritableComparable interface. WritableComparable is used because records are sorted in key order.
org.apache.hadoop.io.Writable interface.
(H)DFSMapFileWrapper base view edition
View schema
The execution of the wrapper returns the key/value pairs contained in the file or group of files, if the Path input parameter denotes a directory.
View results
Warning |
WebHDFSFileWrapper is deprecated.
HDFSAvroFileWrapper, HDFSSequenceFileWrapper, HDFSMapFileWrapper or HDFSParquetFileWrapper with webhdfs scheme in their File system URI parameter. And placing their credentials in the xml configuration files. |
Custom wrapper to retrieve file information from a distributed file system.
This custom wrapper datasource need the following parameters:
<bucket> cannot contain underscores, see S3 naming conventions.
For configuring the credentials see S3 section.
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb[s]://<container>\@<account>.blob.core.windows.net
For configuring the credentials see Azure Blob Storage section.
abfs[s]://<filesystem>\@<account>.dfs.core.windows.net
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Cloud Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
DFSDListFilesFilesWrapper data source edition
The entry point for querying the wrapper is the parameter parentfolder. The wrapper will list the files that are located in this supplied directory. It is possible to do this in a recursive way, retrieving also the contents of the subfolders, setting the parameter recursive to true.
Execution panel
The schema of the custom wrapper contains the following columns:
View schema
The following VQL sentence returns the files in the ‘/user/cloudera’ hdfs directory, recursively:
SELECT * FROM listing_dfs
WHERE parentfolder = '/user/cloudera' AND recursive = true
View results
You can filter our query a bit more and retrieve only those files that were modified after '2018-09-01':
SELECT * FROM listing_dfs
WHERE parentfolder = '/user/cloudera' AND recursive = true
AND datemodified > DATE '2018-09-01'
View results
The wrappers of this distribution that reads file formats like DelimitedFiles, Parquet, Avro, Sequence or Map, would increase their capabilities when combined with the DFSListFilesWrapper.
As all of these wrappers need an input path for the file or the directory that is going to be read, you can use the DFSListFilesWrapper for retrieving the file paths that you are interested in, according to some attribute value of their metadata, e.g. modification time.
For example, suppose that you want to retrieve the files in the /user/cloudera/df/awards directory that were modified in November.
The following steps explain how to configure this scenario:
Parameterize the Path of the base view by adding an interpolation variable to its value, e.g. @path, (@ is the prefix that identifies a value parameter as an interpolation variable).
By using the variable @path, you do not have to provide the final path value when creating the base view. Instead, the values of the Path parameter will be provided at runtime by the DFSListFilesWrapper view through the join operation (configured in the next step).
DFSListFilesWrapper.pathwithoutscheme = HDFSDelimitedTextFileWrapper.path
SELECT * FROM join:view
WHERE recursive = true
AND parentfolder = '/user/cloudera/df/awards'
AND datemodified > DATE '2018-11-1'
you obtain data only from the delimited files that were modified in November.
The Distributed File System Custom Wrapper can access data stored in S3 with the following Hadoop FileSystem clients:
Compatible with files created by the older s3n:// client and Amazon EMR’s s3:// client.
S3A supports several authentication mechanisms. By default the custom wrapper will search for credentials in the following order:
For using this authentication method, declare the credentials (access and secret keys) in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.access.key</name> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>YOUR SECRET ACCESS KEY</value> </property> </configuration> |
AWS_SECRET_ACCESS_KEY are looked for.
An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs. This mechanism is available only when running your application on an Amazon EC2 instance and there is an IAM role associated with the instance, but provides the greatest ease of use and best security when working with Amazon EC2 instances.
Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.
Temporary Security Credentials can be obtained from the Amazon Security Token Service; these consist of an access key, a secret key, and a session token.
To authenticate with these:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider </value> </property> <property> <name>fs.s3a.access.key</name> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>YOUR SECRET ACCESS KEY</value> </property> <property> <name>fs.s3a.session.token </name> <value>SECRET-SESSION-TOKEN</value> </property> </configuration> |
The lifetime of session credentials are fixed when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS, so you must get a new set of credentials.
Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.
To use assumed roles, the wrapper must be configured to use the Assumed Role Credential Provider, org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider, in
the configuration option fs.s3a.aws.credentials.provider in the wrapper configuration file Custom core-site.xml.
This Assumed Role Credential provider will read in the fs.s3a.assumed.role.* options needed to connect to the Session Token Service Assumed Role API:
fs.s3a.access.key and fs.s3a.secret.key pair, environment variables, or some other supplier of long-lived secrets.
If you wish to use a different authentication mechanism, other than
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, set it in the property fs.s3a.assumed.role.credentials.provider.
Below you can see the properties required for configuring IAM Assumed Roles in this custom wrapper, using its configuration file, Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider</value> <value>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider</value> </property> <property> <name>fs.s3a.assumed.role.arn</name> <value>YOUR AWS ROLE</value> <description> AWS ARN for the role to be assumed. Required if the fs.s3a.aws.credentials.provider contains org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider </description> </property> <property> <name>fs.s3a.assumed.role.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value> <description> List of credential providers to authenticate with the STS endpoint and retrieve short-lived role credentials. Only used if AssumedRoleCredentialProvider is the AWS credential Provider. If unset, uses "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider". </description> </property> <property> <name>fs.s3a.assumed.role.session.duration</name> <value>30m</value> <description> Duration of assumed roles before a refresh is attempted. Only used if AssumedRoleCredentialProvider is the AWS credential Provider. Range: 15m to 1h </description> </property> <property> <name>fs.s3a.access.key</name> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>YOUR SECRET ACCESS KEY</value> </property> </configuration> |
Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.
Bucket names cannot contain underscores, see S3 naming conventions.
The property fs.s3a.endpoint is mandatory when accessing S3 compatible storage, that is, different from Amazon S3.
The property fs.s3a.path.style.access could be mandatory, depending on whether the virtual host style addressing or path style addressing is being used (by default the host style is enabled):
Also note in the configuration below, that SSL could be enabled or disabled. If it is enabled, the Denodo server has to be configured to validate the SSL certificate.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.endpoint</name> <value>IP ADDRESS TO CONNECT TO</value> </property> <property> <name>fs.s3a.path.style.access</name> <value>true/false</value> <description>Enables S3 path style access that is disabling the default virtual hosting behavior(default: false)</description> </property> <property> <name>fs.s3a.access.key</name> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3a.secret.key</name> <value>YOUR SECRET ACCESS KEY</value> </property> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>true/false</value> <description>Enables or disables SSL connections to S3 (default: true)</description> </property> </configuration> |
Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.
Connecting to AWS regions that only support V4 of the AWS Signature protocol (those created since January 2014) will require the explicit region endpoint URL to be specified. This is done in the configuration option fs.s3a.endpoint in the Custom core-site.xml parameter of the wrapper, or at the corresponding data source configuration input when using an S3-specific wrapper implementation. You can use the core-site.xml, located in the conf folder of the distribution, as a guide. Otherwise a Bad Request exception could be thrown.
As an example of configuration, the endpoint for S3 Frankfurt is
S3.eu-central-1.amazonaws.com:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.endpoint</name> <value>s3.eu-central-1.amazonaws.com</value> <description>AWS S3 endpoint to connect to. An up-to-date list is this property, the standard region (s3.amazonaws.com) is assumed. </property> </configuration> |
You can find the full list of supported versions for AWS Regions in their website: Amazon Simple Storage Service (Amazon S3).
When connecting to Amazon S3 via an VPC Interface Endpoint the endpoint URL needs to be explicitly specified as explained above, and it will have a format such as https://bucket.vpce-xxxxx.s3.eu-west-1.vpce.amazonaws.com. In these cases, besides the endpoint URL, the region to connect to also needs to be explicitly specified by means of the fs.s3a.endpoint.region property in the core-site.xml configuration file:
<property> <name>fs.s3a.endpoint.region</name> <value>eu-west-1</value> </property> |
Also if you want to assume a role it's necessary to define a sts VPC Interface endpoint for this service and specify the fs.s3a.assumed.role.sts.endpoint and fs.s3a.assumed.role.sts.endpoint.region properties in the core-site.xml configuration file:
<property> <name>fs.s3a.assumed.role.sts.endpoint</name> <value>vpce-xxxxx.sts.us-east-1.vpce.amazonaws.com</value> </property> <property> <name>fs.s3a.assumed.role.sts.endpoint.region</name> <value>us-east-1</value> </property> |
The Distributed File System Custom Wrapper can access data stored in Azure Data Lake Storage.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.adl.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>fs.adl.oauth2.refresh.url</name> <value>YOUR TOKEN ENDPOINT</value> </property> <property> <name>fs.adl.oauth2.client.id</name> <value>YOUR CLIENT ID</value> </property> <property> <name>fs.adl.oauth2.credential</name> <value>YOUR CLIENT SECRET</value> </property> </configuration> |
Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.
The Distributed File System Custom Wrapper can access data stored in Azure Blob Storage.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.azure.account.key.<account>.blob.core.windows.net</name> <value>YOUR ACCESS KEY</value> </property> </configuration> |
Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.
Since the Distributed File System Custom Wrapper for Denodo 7.0, (as this functionality requires Java 8), this wrapper can access data stored in Azure Data Lake Storage Gen 2.
By default ADLS Gen2 uses TLS, both with abfs:// and abfss://. When you set the fs.azure.always.use.https=false property, TLS is disabled with abfs://, and TLS is enabled with abfss://.
To configure the authentication properties place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
You can choose between these two authentication methods:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.azure.account.auth.type</name> <value>OAuth</value> </property> <property> <name>fs.azure.account.oauth.provider.type</name> <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value> </property> <property> <name>fs.azure.account.oauth2.client.endpoint</name> <value>URL of OAuth endpoint</value> </property> <property> <name>fs.azure.account.oauth2.client.id</name> <value>CLIENT-ID</value> </property> <property> <name>fs.azure.account.oauth2.client.secret</name> <value>SECRET</value> </property> </configuration> |
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.azure.account.key.<account>.dfs.core.windows.net</name> <value>YOUR ACCOUNT KEY</value> </property> </configuration> |
Note that all secrets can be stored in JCEKS files. These are encrypted and password protected files. For more information see Hadoop CredentialProvider Guide.
Since the Distributed File System Custom Wrapper for Denodo 7.0, (as this functionality requires Java 8), this wrapper can access data stored in Google Cloud Storage.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
<property> <name>google.cloud.auth.service.account.enable</name> <value>true</value> <description>Whether to use a service account for GCS authorization. If an email and keyfile are provided then that service account will be used. Otherwise the connector will look to see if it running On a GCE VM with some level of GCS access in its service account scope, and use that service account.</description> </property> <property> <name>google.cloud.auth.service.account.json.keyfile</name> <value>/PATH/TO/KEYFILE</value> <description>The JSON key file of the service account used for GCS access when google.cloud.auth.service.account.enable is true.</description> </property> </configuration> |
Wrappers that read files content from Google Cloud Storage, like (H)DFSDelimitedTextFileWrapper, (H)DFSAvroFileWrapper, etc. requires accessing with storage.objects.get permissions.
The DFSListFilesWrapper, as it lists files from buckets, requires accessing with
storage.buckets.get permissions.
For more information on roles and permissions see https://cloud.google.com/storage/docs/access-control/iam-roles.
The Distributed File System Custom Wrapper transparently read compressed files in any of these compression formats:
The configuration required for accessing a Hadoop cluster with Kerberos enabled is the same as the one needed to access the distributed file system and, additionally, the user must supply the Kerberos credentials.
The Kerberos parameters are:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the Kerberos principal name contains @ you have to enter \@. |
The Distributed File System Custom Wrapper provides three ways for accessing a kerberized Hadoop cluster:
In this case only the Kerberos enabled parameter should be checked. The wrapper would use the Kerberos ticket to authenticate itself against the Hadoop cluster.
In all these three scenarios the krb5.conf file should be present in the file system. Below there is an example of the Kerberos configuration file:
[libdefaults] renew_lifetime = 7d forwardable = true default_realm = EXAMPLE.COM ticket_lifetime = 24h dns_lookup_realm = false dns_lookup_kdc = false [domain_realm] sandbox.hortonworks.com = EXAMPLE.COM cloudera = CLOUDERA [realms] EXAMPLE.COM = { admin_server = sandbox.hortonworks.com kdc = sandbox.hortonworks.com } CLOUDERA = { kdc = quickstart.cloudera admin_server = quickstart.cloudera max_renewable_life = 7d 0h 0m 0s default_principal_flags = +renewable } [logging] default = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log kdc = FILE:/var/log/krb5kdc.log |
The algorithm to locate the krb5.conf file is the following:
There is an exception. If you are planning to create VDP views that use the same Key Distribution Center and the same realm the Kerberos Distribution Center parameter can be provided instead of having the krb5.conf file in the file system.
Data source edition
Symptom
Error message: “org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block”.
Resolution
Add this property to the Custom hdfs-site.xml file:
<property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property> |
Symptom
Error message: “SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]”.
Resolution
You are trying to connect to a Kerberos-enabled Hadoop cluster. You should configure the custom wrapper accordingly. See Secure cluster with Kerberos section for configuring Kerberos on this custom wrapper.
Symptom
Error message: “Cannot get Kerberos service ticket: KrbException: Server not found in Kerberos database (7) ”.
Resolution
Check that nslookup is returning the fully qualified hostname of the KDC. If not, modify the /etc/hosts of the client machine for the KDC entry to be of the form "IP address fully.qualified.hostname alias".
Symptom
Error message: “This authentication mechanism is no longer supported".
Resolution
The method of placing credentials in the URL, s3a://<id>:<secret>@<bucket>, is discouraged. Configure the credentials on the core-site.xml instead (see S3 section).
Symptom
Error message: “Could not initialize class org.xerial.snappy.Snappy”
Resolution
On Linux platforms, an error may occur when Snappy compression/decompression is enabled although its native library is available from the classpath.
The native library snappy-<version>-libsnappyjava.so for Snappy compression is included in the snappy-java-<version>.jar file. When the JVM initializes the JAR, the library is added to the default temp directory. If the default temp directory is mounted with noexec option, it results in the above exception.
To solve it you have to specify, as a JVM option of the Denodo server, a different temp directory, that has already been mounted without the noexec option:
-Dorg.xerial.snappy.tempdir=/path/to/newtmp |
Symptom
Error message: “Unable to find a region via the region provider chain”
Resolution
Some Amazon S3-compatible storage systems, as well as some endpoint URL formats, prevent the client libraries used by the wrapper to adequately determine the region that the bucket to be accessed lives in.
In such cases, automatic region resolution can be overridden by means of specifying the fs.s3a.endpoint.region property in the core-site.xml configuration file:
<property> <name>fs.s3a.endpoint.region</name> <value>eu-west-1</value> </property> |
Symptom
Error message: “The authorization header is malformed; the region 'vpce' is wrong; expecting '<region>' (Service: Amazon S3; Status Code: 400; Error Code: AuthorizationHeaderMalformed)”
Resolution
When using an Interface VPC Endpoint from Amazon AWS PrivateLink, the endpoint URLs to be used have a different format than the standard Amazon S3 endpoint URLs (like for example https://bucket.vpce-xxxxx.s3.eu-west-1.vpce.amazonaws.com), and the client libraries used by the wrapper will not be able to correctly determine the region to connect to.
In such cases, the region needs to be explicitly specified in the wrapper’s configuration by means of the fs.s3a.endpoint.region property in the core-site.xml file:
<property> <name>fs.s3a.endpoint.region</name> <value>eu-west-1</value> </property> |
In some cases, it is advisable to use the libraries of the Hadoop vendor you are connecting to (Cloudera, Hortonworks, …), instead of the Apache Hadoop libraries distributed in this custom wrapper.
In order to use the Hadoop vendor libraries there is no need to import the Distributed File System Custom Wrapper as an extension as it is explained in the Importing the custom wrapper into VDP section.
You have to create the custom data sources using the ‘Classpath’ parameter instead of the ‘Select Jars’ option.
Click Browse to select the directory containing the required dependencies for this custom wrapper, that is:
Here you can find the libraries for Cloudera and Hortonworks Hadoop distributions:
http://repo.hortonworks.com/content/repositories/releases/org/apache/hadoop/
C:\Work\denodo-hdfs-libs directory
Distributed File System Data Source
! Note |
When clicking Browse, you will browse the file system of the host where the Server is running and not where the Administration Tool is running. |
From MapR documentation: “MapR XD Distributed File and Object Store manages both structured and unstructured data. It is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform.”
As MapR XD supports HDFS-compatible API, you can use the DFS Custom Wrapper to connect to MapR FileSystem. This section explains how to do that.
! Important |
Please check MapR’s Java Compatibility matrix to verify that your version of MapR supports working with the version of the JVM used by your installation of the Denodo Platform. Note that MapR 6.1.x and previous versions do not support Java 11, which is required by Denodo 8. |
To connect to the MapR cluster you need to install the MapR Client on your client machine (where the VDP server is running):
Set $MAPR_HOME environment variable to the directory where MapR client was installed. If the MAPR_HOME environment variable is not defined /opt/mapr is the default path.
Copy mapr-clusters.conf from the MapR cluster to the $MAPR_HOME/conf folder in the VDP machine.
demo.mapr.com secure=true maprdemo:7222 |
Every user who wants to access a secure cluster must have a MapR ticket (maprticket_<username>) in the temporary directory (the default location).
Use the $MAPR_HOME/maprlogin command line tool to generate one:
C:\opt\mapr\bin>maprlogin.bat password -user mapr [Password for user 'mapr' at cluster 'demo.mapr.com': ] MapR credentials of user 'mapr' for cluster 'demo.mapr.com' are written to 'C:\Users\<username>\AppData\Local\Temp/maprticket_<username>' |
Add -Dmapr.library.flatclass to the VDP Server JVM options.
VDP Server JVM options
Otherwise, VDP will throw the exception java.lang.UnsatisfiedLinkError from JNISecurity.SetParsingDone() while executing the Kafka Custom Wrapper.
In order to use the MapR vendor libraries you should not import the DFS Custom Wrapper into Denodo.
You have to create the custom data source using the ‘Classpath’ parameter instead of the ‘Select Jars’ option. Click Browse to select the directory containing the required dependencies for this custom wrapper:
The MapR Maven repository is located at http://repository.mapr.com/maven/. The name of the JAR files that you must use contains the version of Hadoop, Kafka, Zookeeper and MapR that you are using:
As MapRClient native library is bundled in maprfs-<mapr_version> jar you should use the maprfs jar that comes with the Mapr Client, previously installed, as the library is dependent on the operating system.
! Important |
MapR native library is included in these Custom Wrapper dependencies and can be loaded only once. Therefore, if you plan to access to other MapR sources with Denodo, like:
you have to use the same classpath to configure all the custom wrappers and the JDBC driver; see 'C:\Work\MapR Certification\mapr-lib' in the image above. With this configuration Denodo can reuse the same classloader and load the native library only once. |
Configure the DFS wrapper parameters as usual:
MapR data source edition
MapR base view edition