Warning |
Although this wrapper is capable of reading files stores in HDFS, Amazon S3, Azure Blob Storage, Azure Data Lake Storage and Google Cloud Storage, most of the technical artifacts of this wrapper include HDFS in their names for legacy compatibility:
|
The Distributed File System Custom Wrapper distribution contains five Virtual DataPort custom wrappers capable of reading several file formats stored in HDFS, Amazon S3, Azure Data Lake Storage, Azure Blob Storage, Azure Data Lake Storage Gen 2 and Google Cloud Storage.
Supported formats are:
Also, there is a custom wrapper to retrieve information from the distributed file system and display it in a relational way:
This wrapper allows to inspect distributed folders, retrieve lists of files (in a single folder or recursively) and filter files using any part of its metadata (file name, file size, last modification date, etc.).
Delimited text files store plain text and each line has values separated by a delimiter, such as tab, space, comma, etc.
Sequence files are binary record-oriented files, where each record has a serialized key and a serialized value.
A map is a directory containing two sequence files. The data file (/data) is identical to the sequence file and contains the data stored as binary key/value pairs. The index file (/index), which contains a key/value map with seek positions inside the data file to quickly access the data.
Map file format
Avro data files are self-describing, containing the full schema for the data in the file. An Avro schema is defined using JSON. The schema allows you to define two types of data:
Avro schema:
{ "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } |
Parquet is a column-oriented data store of the Hadoop ecosystem. It provides data compression on a per-column level and encoding schemas.
The data are described by a schema that starts with the word Message and contains a group of fields. Each field is defined by a repetition (required, optional, or repeated), a type and a name.
Parquet schema:
Message Customer { required int32 id; required binary firstname (UTF8); required binary lastname (UTF8); } |
Primitives types in parquet are boolean, int32, int64, int96, float, double, binary and fixed_len_byte_array. There are no String types but there are logical types which allows interpreting binaries as a String, JSON or other types.
Complex types are defined by a group type, which adds a layer of nesting.
The Distributed File System Custom Wrapper distribution consists of:
In order to use the Distributed File System Custom Wrapper in VDP, we must configure the Admin Tool to import the extension.
From the Distributed File System Custom Wrapper distribution, we will select the denodo-hdfs-customwrapper-${version}-jar-with-dependencies.jar file and upload it to VDP.
Important |
As this wrapper, (the jar-with-dependencies version), contains the Hadoop client libraries themselves, increasing the JVM's heap space for VDP Admin Tool is required to avoid a Java heap space when uploading the jar to VDP. |
No other jars are required as this one will already contain all the required dependencies.
Distributed File System extension in VDP
Once the custom wrapper jar file has been uploaded to VDP using the Admin Tool, we can create new data sources for this custom wrapper --and their corresponding base views-- as usual.
Go to New → Data Source → Custom and specify one of the possible wrappers:
Also check ‘Select Jars’ and choose the jar file of the custom wrapper.
Distributed File System Data Source
Depending of the selected wrapper we will have different input parameters. To update the parameters, we must press the refresh button.
Custom wrapper for reading delimited text files. This custom wrapper datasource need the following parameters:
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb://<container>\@<account>.blob.core.windows.net
or wasbs:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Blob Storage section.
abfs://<filesystem>\@<account>.dfs.core.windows.net
or abfss:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
HDFSDelimitedTextFileWrapper data source edition
Once the custom wrapper has been registered, we will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. invoice_jan.csv, invoice_feb.csv, … set the File name pattern to (.*)invoice_(.*)\\.csv, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
Some “invisible” characters have to be entered in a special way:
Character |
Meaning |
\t |
Tab |
\f |
Formfeed |
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the separator is the tab character \t you have to enter \\t. |
quote (“).
disabled by default.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the null value is \N you have to enter \\N. |
If you clear this check box, the wrapper will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the wrapper has ignored any row in a query in the execution trace, in the attribute “Number of invalid rows”.
HDFSDelimitedTextFileWrapper base view edition
View schema
The execution of the wrapper returns the values contained in the file or group of files, if the Path input parameter denotes a directory.
View results
Custom wrapper for reading sequence files. This custom wrapper datasource need the following parameters:
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb://<container>\@<account>.blob.core.windows.net
or wasbs:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Blob Storage section.
abfs://<filesystem>\@<account>.dfs.core.windows.net
or abfss:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \.
E.g if the URI contains @ you have to enter \@. |
HDFSSequenceFileWrapper data source edition
Once the custom wrapper has been registered, we will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. file_1555297166.seq, file_1555300766.seq, … set the File name pattern to (.*)file_(.*)\\.seq, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
HDFSSequenceFileWrapper base view edition
View schema
The execution of the wrapper returns the key/value pairs contained in the file or group of files, if the Path input parameter denotes a directory.
View results
Custom wrapper for reading map files. This custom wrapper datasource need the following parameters:
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb://<container>\@<account>.blob.core.windows.net
or wasbs:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Blob Storage section.
abfs://<filesystem>\@<account>.dfs.core.windows.net
or abfss:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
HDFSMapFileWrapper data source edition
Once the custom wrapper has been registered, we will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. invoice_jan.whatever, invoice_feb.whatever, … set the File name pattern to (.*)invoice_(.*)\\.whatever, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
org.apache.hadoop.io.WritableComparable interface. WritableComparable is used because records are sorted in key order.
org.apache.hadoop.io.Writable interface.
HDFSMapFileWrapper base view edition
View schema
The execution of the wrapper returns the key/value pairs contained in the file or group of files, if the Path input parameter denotes a directory.
View results
Custom wrapper for reading Avro files.
Important |
We recommend not to use the HDFSAvroFileWrapper to directly access Avro files, as this is an internal serialization system mainly meant for use by applications running on the Hadoop cluster. Instead, we recommend to use an abstraction layer on top of those files like e.g. Hive, Impala, Spark... |
This custom wrapper datasource need the following parameters:
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb://<container>\@<account>.blob.core.windows.net
or wasbs:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Blob Storage section.
abfs://<filesystem>\@<account>.dfs.core.windows.net
or abfss:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
HDFSAvroFileWrapper data source edition
Once the custom wrapper has been registered, we will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. employees_jan.avro, employees_feb.avro, … set the File name pattern to (.*)employees_(.*)\\.avro, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
There are also two parameters that are mutually exclusive:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, } in the Avro schema JSON parameter, you have to escape these characters with \. For example: \{ "type": "map", "values": \{ "type": "record", "name": "ATM", "fields": [ \{ "name": "serial_no", "type": "string" \}, \{ "name": "location", "type": "string" \} ] \} \} |
HDFSAvroFileWrapper base view edition
Content of the /user/cloudera/schema.avsc file:
{"type" : "record", "name" : "Doc", "doc" : "adoc", "fields" : [ { "name" : "id", "type" : "string" }, { "name" : "user_friends_count", "type" : [ "int", "null" ] }, { "name" : "user_location", "type" : [ "string", "null" ] }, { "name" : "user_description", "type" : [ "string", "null" ] }, { "name" : "user_statuses_count", "type" : [ "int", "null" ] }, { "name" : "user_followers_count", "type" : [ "int", "null" ] }, { "name" : "user_name", "type" : [ "string", "null" ] }, { "name" : "user_screen_name", "type" : [ "string", "null" ] }, { "name" : "created_at", "type" : [ "string", "null" ] }, { "name" : "text", "type" : [ "string", "null" ] }, { "name" : "retweet_count", "type" : [ "int", "null" ] }, { "name" : "retweeted", "type" : [ "boolean", "null" ] }, { "name" : "in_reply_to_user_id", "type" : [ "long", "null" ] }, { "name" : "source", "type" : [ "string", "null" ] }, { "name" : "in_reply_to_status_id", "type" : [ "long", "null" ] }, { "name" : "media_url_https", "type" : [ "string", "null" ] }, { "name" : "expanded_url", "type" : [ "string", "null" ] } ] } |
View schema
The execution of the view returns the values contained in the Avro file specified in
the WHERE clause of the VQL sentence:
SELECT * FROM avro_ds_file WHERE avrofilepath = '/user/cloudera/file.avro' |
View results
After applying a flattening operation results are as follows.
Flattened results
The recommended way for dealing with projections in HDFSAvroFileWrapper is by means of the JSON schema parameters:
By giving to the wrapper a JSON schema containing exclusively the fields we are interested in, the reader used by the HDFSAvroFileWrapper will return to VDP only these fields, making the select operation faster.
If we configure the parameter Avro schema JSON with only some of the fields of the /user/cloudera/schema.avsc file used in the previous example, like in the example below (notice the escaped characters):
Schema with the selected fields:
\{ "type" : "record", "name" : "Doc", "doc" : "adoc", "fields" : [ \{ "name" : "id", "type" : "string" \}, \{ "name" : "user_friends_count", "type" : [ "int", "null" ] \}, \{ "name" : "user_location", "type" : [ "string", "null" ] \}, \{ "name" : "user_followers_count", "type" : [ "int", "null" ] \}, \{ "name" : "user_name", "type" : [ "string", "null" ] \}, \{ "name" : "created_at", "type" : [ "string", "null" ] \} ] \} |
the base view in VDP will contain a subset of the previous base view of the example: the ones matching the new JSON schema provided to the wrapper.
Base view with the selected fields
View results with the selected fields
Warning |
WebHDFSFileWrapper is deprecated.
HDFSAvroFileWrapper, HDFSSequenceFileWrapper, HDFSMapFileWrapper or HDFSParquetFileWrapper with webhdfs scheme in their File system URI parameter. And placing their credentials in the xml configuration files. |
Custom wrapper for reading delimited text files using the WebHDFS.
WebHDFS provides HTTP REST access to HDFS. It supports all HDFS user operations including reading files, writing to files, making directories, changing permissions and renaming.
The advantage of WebHDFS are:
The only difference between using or not the proxy will be in the host:port pair where the HTTP requests are issued:
This custom wrapper datasource need the following parameters:
When using Amazon S3 <id>:<secret> should be indicated.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the null value is \N you have to enter \\N. |
WebHDFSFileWrapper data source edition
Once the custom wrapper has been registered, we will be asked by VDP to create a base view for it. Its base views need the following parameters:
quote.
disable by default.
WebHDFSFileWrapper base view edition
View schema
The execution of the wrapper returns the values contained in the file.
View results
Custom wrapper for reading Parquet files in Amazon S3. This custom wrapper datasource need the following parameters:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
S3ParquetFileWrapper data source edition
Once the custom wrapper has been registered, we will be asked by VDP to create a base view for it. Its base views need the following parameters:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
S3ParquetFileWrapper base view edition
View schema
The execution of the wrapper returns the values contained in the file.
View results
This wrapper implement Predicate Pushdown. Using Predicate Pushdown, the Parquet API can apply filters without reading all the data from the file, which reduces the execution time of each query. We can delegate:
And operators:
Custom wrapper for reading Parquet files.
Important |
We recommend not to use the HDFSParquetFileWrapper to directly access Parquet files, as this is an internal columnar data representation mainly meant for use by applications running on the Hadoop cluster. Instead, we recommend to use an abstraction layer on top of those files like e.g. Hive, Impala, Spark... |
This custom wrapper datasource need the following parameters:
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb://<container>\@<account>.blob.core.windows.net
or wasbs:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Blob Storage section.
abfs://<filesystem>\@<account>.dfs.core.windows.net
or abfss:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Data Lake Storage Gen 2 section.
gs://<bucket>
For configuring the credentials see Google Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
HDFSParquetFileWrapper data source edition
Once the custom wrapper has been registered, we will be asked by VDP to create a base view for it. Its base views need the following parameters:
For example, if you want the base view to return the data of all the files that follow a pattern in their names, e.g. flights_jan.parquet, flights_feb.parquet, … set the File name pattern to (.*)flights_(.*)\\.parquet, (notice that the regular expression is escaped as explained in the note below). Files belonging to these directories were going to be processed by the wrapper:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the File name pattern contains \ you have to enter \\. |
HDFSParquetFileWrapper base view edition
View schema
The execution of the wrapper returns the values contained in the file.
View results
This wrapper implement Predicate Pushdown. Using Predicate Pushdown, the Parquet API can apply filters without reading all the data from the file, which reduces the execution time of each query.
Custom wrapper to retrieve file information from a distributed file system.
This custom wrapper datasource need the following parameters:
adl://<account name>.azuredatalakestore.net/
For configuring the credentials see Azure Data Lake Storage section.
wasb://<container>\@<account>.blob.core.windows.net
or wasbs:// for SSL encrypted HTTPS access
For configuring the credentials see Azure Blob Storage section.
gs://<bucket>
For configuring the credentials see Google Cloud Storage section.
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the URI contains @ you have to enter \@. |
DFSDListFilesFilesWrapper data source edition
The entry point for querying the wrapper is the parameter parentfolder. The wrapper will list the files that are located in this supplied directory. It is possible to do this in a recursive way, retrieving also the contents of the subfolders, setting the parameter recursive to true.
Execution panel
The schema of the custom wrapper contains the following columns:
View schema
The following VQL sentence returns the files in the ‘/user/cloudera’ hdfs directory, recursively:
SELECT * FROM listing_dfs
WHERE parentfolder = '/user/cloudera' AND recursive = true
View results
We can filter our query a bit more and retrieve only those files that were modified after '2018-09-01':
SELECT * FROM listing_dfs
WHERE parentfolder = '/user/cloudera' AND recursive = true
AND datemodified > DATE '2018-09-01'
View results
The wrappers of this distribution that reads file formats like Parquet, Avro, Delimited Files, Sequence or Map, can increase their capabilities when combined with the DFSListFilesWrapper.
As all of these wrappers need an input path for the file or the directory that is going to be read, we can use the DFSListFilesWrapper for retrieving the file paths that we are interested in, according to some attribute value of their metadata, e.g. modification time.
For example, suppose that we want to retrieve the files in the /user/cloudera/df/awards directory that were modified in November.
The following steps explain how to configure this scenario:
Parameterize the Path of the base view by adding an interpolation variable to its value, e.g. @path, (@ is the prefix that identifies a value parameter as an interpolation variable).
By using the variable @path, you do not have to provide the final path value when creating the base view. Instead, the values of the Path parameter will be provided at runtime by the DFSListFilesWrapper view through the join operation (configured in the next step).
DFSListFilesWrapper.pathwithoutscheme = HDFSDelimitedTextFileWrapper.path
SELECT * FROM join:view
WHERE recursive = true
AND parentfolder = '/user/cloudera/df/awards'
AND datemodified > DATE '2018-11-1'
we obtain data only from the delimited files that were modified in November.
The Distributed File System Custom Wrapper can access data stored in Amazon S3 with the following Hadoop FileSystem clients:
It is deprecated and it is not supported by the new version of this custom wrapper, version 7.0, as it was deleted from Hadoop 3.x versions.
Use S3A instead, as S3A client can read all files created by S3N.
S3N is not supported by the new version of this custom wrapper, version 7.0, as it was deleted from Hadoop 3.x versions.
S3A client can read all files created by S3N. It should be used wherever possible.
If we are reading parquet files in Amazon S3 we should use the S3ParquetFileWrapper. This wrapper uses s3a by default. In this case you should use the wrapper parameters discussed in section S3ParquetFileWrapper.
We have different options to connect with an S3 bucket:
If we need to change our connection configuration we can use a Custom core-site.xml file as explained in the following points.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3.awsAccessKeyId</name> <description>AWS access key ID</description> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3.awsSecretAccessKey</name> <description>AWS secret key</description> <value>YOUR SECRET ACCESS KEY</value> </property> </configuration> |
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3n.awsAccessKeyId</name> <description>AWS access key ID</description> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3n.awsSecretAccessKey</name> <description>AWS secret key</description> <value>YOUR SECRET ACCESS KEY</value> </property> </configuration> |
S3A supports several authentication mechanisms. By default the custom wrapper will search for credentials in the following order:
For using this authentication method, declare the credentials (access and secret keys) in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.access.key</name> <description>AWS access key ID.</description> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key.</description> <value>YOUR SECRET ACCESS KEY</value> </property> </configuration> |
AWS_SECRET_ACCESS_KEY are looked for.
An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs. This mechanism is available only when running your application on an Amazon EC2 instance and there is an IAM role associated with the instance, but provides the greatest ease of use and best security when working with Amazon EC2 instances.
Temporary Security Credentials can be obtained from the Amazon Security Token Service; these consist of an access key, a secret key, and a session token.
To authenticate with these:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider </value> </property> <property> <name>fs.s3a.access.key</name> <description>AWS access key ID.</description> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key.</description> <value>YOUR SECRET ACCESS KEY</value> </property> <property> <name>fs.s3a.session.token </name> <value>SECRET-SESSION-TOKEN</value> </property> </configuration> |
The lifetime of session credentials are fixed when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS, so you must get a new set of credentials.
To use assumed roles, the wrapper must be configured to use the Assumed Role Credential Provider, org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider, in
the configuration option fs.s3a.aws.credentials.provider in the wrapper configuration file Custom core-site.xml.
This Assumed Role Credential provider will read in the fs.s3a.assumed.role.* options needed to connect to the Session Token Service Assumed Role API:
fs.s3a.access.key and fs.s3a.secret.key pair, environment variables, or some other supplier of long-lived secrets.
If you wish to use a different authentication mechanism, other than
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider, set it in the property fs.s3a.assumed.role.credentials.provider.
Below you can see the properties required for configuring IAM Assumed Roles in this custom wrapper, using its configuration file, Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider</value> <value>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider</value> </property> <property> <name>fs.s3a.assumed.role.arn</name> <description> AWS ARN for the role to be assumed. Required if the fs.s3a.aws.credentials.provider contains org.apache.hadoop.fs.s3a.AssumedRoleCredentialProvider </description> <value>YOUR AWS ROLE</value> </property> <property> <name>fs.s3a.assumed.role.credentials.provider</name> <description> List of credential providers to authenticate with the STS endpoint and retrieve short-lived role credentials. Only used if AssumedRoleCredentialProvider is the AWS credential Provider. If unset, uses "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider". </description> <value>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</value> </property> <property> <name>fs.s3a.assumed.role.session.duration</name> <value>30m</value> <description> Duration of assumed roles before a refresh is attempted. Only used if AssumedRoleCredentialProvider is the AWS credential Provider. Range: 15m to 1h </description> </property> <property> <name>fs.s3a.access.key</name> <description>AWS access key ID.</description> <value>YOUR ACCESS KEY ID</value> </property> <property> <name>fs.s3a.secret.key</name> <description>AWS secret key.</description> <value>YOUR SECRET ACCESS KEY</value> </property> </configuration> |
When the V4 signing protocol is used, AWS requires the explicit region endpoint to be used —hence S3A must be configured to use the specific endpoint. This is done in the configuration option fs.s3a.endpoint in the Custom core-site.xml of the wrapper. You can use the core-site.xml, located in the conf folder of the distribution, as a guide. Otherwise a Bad Request exception could be thrown.
As an example of configuration, the endpoint for S3 Frankfurt is
S3.eu-central-1.amazonaws.com:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.s3a.endpoint</name> <description>AWS S3 endpoint to connect to. An up-to-date list is this property, the standard region (s3.amazonaws.com) is assumed. <value>s3.eu-central-1.amazonaws.com</value> </property> </configuration> |
You can find the full list of supported versions for AWS Regions in their website: Amazon Simple Storage Service (Amazon S3).
The Distributed File System Custom Wrapper can access data stored in Azure Data Lake Storage.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.adl.oauth2.access.token.provider.type</name> <value>ClientCredential</value> </property> <property> <name>fs.adl.oauth2.refresh.url</name> <value>YOUR TOKEN ENDPOINT</value> </property> <property> <name>fs.adl.oauth2.client.id</name> <value>YOUR CLIENT ID</value> </property> <property> <name>fs.adl.oauth2.credential</name> <value>YOUR CLIENT SECRET</value> </property> </configuration> |
The Distributed File System Custom Wrapper can access data stored in Azure Blob Storage.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.azure.account.key.<account>.blob.core.windows.net</name> <value>YOUR ACCESS KEY</value> </property> </configuration> |
Since the Distributed File System Custom Wrapper for Denodo 7.0, (as this functionality requires Java 8), this wrapper can access data stored in Azure Data Lake Storage Gen 2.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.azure.account.key.<account>.dfs.core.windows.net</name> <value>YOUR ACCOUNT KEY</value> </property> </configuration> |
Since the Distributed File System Custom Wrapper for Denodo 7.0, (as this functionality requires Java 8), this wrapper can access data stored in Google Cloud Storage.
Place the credentials in the wrapper configuration file Custom core-site.xml. You can use the core-site.xml, located in the conf folder of the distribution, as a guide.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
<property> <name>google.cloud.auth.service.account.enable</name> <value>true</value> <description>Whether to use a service account for GCS authorization. If an email and keyfile are provided then that service account will be used. Otherwise the connector will look to see if it running On a GCE VM with some level of GCS access in its service account scope, and use that service account.</description> </property> <property> <name>google.cloud.auth.service.account.json.keyfile</name> <value>/PATH/TO/KEYFILE</value> <description>The JSON key file of the service account used for GCS access when google.cloud.auth.service.account.enable is true.</description> </property> </configuration> |
Wrappers that read files content from Google Cloud Storage, like HDFSDelimitedTextFileWrapper, HDFSAvroFileWrapper, etc. requires accessing with a member with storage.objects.get permissions.
The DFSListFilesWrapper, as it lists files from buckets, requires accessing with a member with
storage.buckets.get permissions.
For more information on roles and permissions see https://cloud.google.com/storage/docs/access-control/iam-roles.
The Distributed File System Custom Wrapper transparently read compressed files in any of these compression formats:
The configuration required for accessing a Hadoop cluster with Kerberos enabled is the same as the one needed to access to the distributed file system and, additionally, the user must supply the Kerberos credentials.
The Kerberos parameters are:
! Note |
If you enter a literal that contains one of the special characters used to indicate interpolation variables @, \, ^, {, }, you have to escape these characters with \. E.g if the Kerberos principal name contains @ you have to enter \@. |
The Distributed File System Custom Wrapper provides three ways for accessing a kerberized Hadoop cluster:
In this case only the Kerberos enabled parameter should be checked. The wrapper would use the Kerberos ticket to authenticate itself against the Hadoop cluster.
In all these three scenarios the krb5.conf file should be present in the file system. Below there is an example of the Kerberos configuration file:
[libdefaults] renew_lifetime = 7d forwardable = true default_realm = EXAMPLE.COM ticket_lifetime = 24h dns_lookup_realm = false dns_lookup_kdc = false [domain_realm] sandbox.hortonworks.com = EXAMPLE.COM cloudera = CLOUDERA [realms] EXAMPLE.COM = { admin_server = sandbox.hortonworks.com kdc = sandbox.hortonworks.com } CLOUDERA = { kdc = quickstart.cloudera admin_server = quickstart.cloudera max_renewable_life = 7d 0h 0m 0s default_principal_flags = +renewable } [logging] default = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log kdc = FILE:/var/log/krb5kdc.log |
The algorithm to locate the krb5.conf file is the following:
There is an exception. If you are planning to create VDP views that use the same Key Distribution Center and the same realm the Kerberos Distribution Center parameter can be provided instead of having the krb5.conf file in the file system.
Data source edition
Symptom
Error message: “SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]”.
Resolution
You are trying to connect to a Kerberos-enabled Hadoop cluster. You should configure the custom wrapper accordingly. See Secure cluster with Kerberos section for configuring Kerberos on this custom wrapper.
Symptom
Error message: “Cannot get Kerberos service ticket: KrbException: Server not found in Kerberos database (7) ”.
Resolution
Check that nslookup is returning the fully qualified hostname of the KDC. If not, modify the /etc/hosts of the client machine for the KDC entry to be of the form "IP address fully.qualified.hostname alias".
Symptom
Error message: “Invalid hostname in URI s3n://<id>:<secret>@<bucket>”.
Resolution
This method of placing credentials in the URL is discouraged. Configure the credentials on the core-site.xml instead (see Amazon S3 support section).
Symptom
Error message: "Error accessing Parquet file: Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path= hdfs://serverhdfs/apps/hive/warehouse/parquet/.hive-staging_hive_2017-03-06_08-/-ext-10000; isDirectory=true; modification_time=1488790684826; access_time=0; owner=hive; group=hdfs; permission=rwxr-xr-x; isSymlink=false}"
Resolution
Hive could store metadata into a parquet file folder. You can check in the error message, if the custom wrapper is trying to access to any metadata. In the error of the example you can see that it is accessing a folder called .hive-staging*. The solution is to configure Hive to store metadata in other location.
Symptom
Error message: “Could not initialize class org.xerial.snappy.Snappy”
Resolution
On Linux platforms, an error may occur when Snappy compression/decompression is enabled although its native library is available from the classpath.
The native library snappy-<version>-libsnappyjava.so for Snappy compression is included in the snappy-java-<version>.jar file. When the JVM initializes the JAR, the library is added to the default temp directory. If the default temp directory is mounted with noexec option, it results in the above exception.
One solution is to specify a different temp directory, that has already been mounted without the noexec option, as follows:
-Dorg.xerial.snappy.tempdir=/path/to/newtmp |
In some cases, it is advisable to use the libraries of the Hadoop vendor you are connecting to (Cloudera, Hortonworks, …), instead of the Apache Hadoop libraries distributed in this custom wrapper.
In order to use the Hadoop vendor libraries there is no need to import the Distributed File System Custom Wrapper as an extension as it is explained in the Importing the custom wrapper into VDP section.
You have to create the custom data sources using the ‘Classpath’ parameter instead of the ‘Select Jars’ option.
Click Browse to select the directory containing the required dependencies for this custom wrapper, that is:
Here you can find the libraries for Cloudera and Hortonworks Hadoop distributions:
http://repo.hortonworks.com/content/repositories/releases/org/apache/hadoop/
C:\Work\denodo-hdfs-libs directory
Distributed File System Data Source
! Note |
When clicking Browse, you will browse the file system of the host where the Server is running and not where the Administration Tool is running. |
From MapR documentation: “MapR XD Distributed File and Object Store manages both structured and unstructured data. It is designed to store data at exabyte scale, support trillions of files, and combine analytics and operations into a single platform.”
As MapR XD supports HDFS-compatible API, you can use the DFS Custom Wrapper to connect to MapR FileSystem. This section explains how to do that.
To connect to the MapR cluster you need to install the MapR Client on your client machine (where the VDP server is running):
Set $MAPR_HOME environment variable to the directory where MapR client was installed. If MAPR_HOME environment variable is not defined /opt/mapr is the default path.
Copy mapr-clusters.conf from the MapR cluster to the $MAPR_HOME/conf folder in the VDP machine.
demo.mapr.com secure=true maprdemo:7222 |
Every user who wants to access a secure cluster must have a MapR ticket (maprticket_<username>) in the temporary directory (the default location).
Use the $MAPR_HOME/maprlogin command line tool to generate one:
C:\opt\mapr\bin>maprlogin.bat password -user mapr [Password for user 'mapr' at cluster 'demo.mapr.com': ] MapR credentials of user 'mapr' for cluster 'demo.mapr.com' are written to 'C:\Users\<username>\AppData\Local\Temp/maprticket_<username>' |
Add -Dmapr.library.flatclass to the VDP Server JVM options.
VDP Server JVM options
Otherwise, VDP will throw the exception java.lang.UnsatisfiedLinkError from JNISecurity.SetParsingDone() while executing the Kafka Custom Wrapper.
In order to use the MapR vendor libraries you should not import the DFS Custom Wrapper into Denodo.
You have to create the custom data source using the ‘Classpath’ parameter instead of the ‘Select Jars’ option. Click Browse to select the directory containing the required dependencies for this custom wrapper:
The MapR Maven repository is located at http://repository.mapr.com/maven/. The name of the JAR files that you must use contains the version of Hadoop, Kafka, Zookeeper and MapR that you are using:
As MapRClient native library is bundled in maprfs-<mapr_version> jar you should use the maprfs jar that comes with the Mapr Client, previously installed, as the library is dependent on the operating system.
! Important |
MapR native library is included in these Custom Wrapper dependencies and can be loaded only once. Therefore, if you plan to access to other MapR sources with Denodo, like:
you have to use the same classpath to configure all the custom wrappers and the JDBC driver; see 'C:\Work\MapR Certification\mapr-lib' in the image above. With this configuration Denodo can reuse the same classloader and load the native library only once. |
Configure the DFS wrapper parameters as usual:
MapR data source edition
MapR base view edition