USER MANUALS

HDFS Path

Use this type of path to obtain the data from a file or a set of files located in a HDFS file system. You can also use this path type with other routes if there is a connector for Hadoop.

Find information about the Filters tab in Compressed or Encrypted Data Sources; filters work the same way for any type of path (local, HTTP, FTP…).

Configuration

In URI, enter the path you want to obtain the data from. It can point to a file or a directory and you can use interpolation variables (see section Paths and Other Values with Interpolation Variables).

In Hadoop properties you can set the same Hadoop properties that you would put in the Hadoop configuration files like core-site.xml. This allows to use other routes if there is a Hadoop connector. This is explained in the section Support for other routes

Paths Pointing to a Directory

When you create a base view over a data source that points to a directory, Virtual DataPort infers the schema of the new view from the first file in the directory and it assumes that all the other files have the same schema.

Only for delimited-file data sources: if the path points to a directory and you enter a value in File name pattern, the data source will only process the files whose name matches the regular expression entered in this box. For example, if you only want to process the files with the extension log, enter (.*)\.log.

Note

For XML data sources, if a Validation file has been provided, all files in the directory have to match that Schema or DTD.

Support for Hadoop-Compatible Routes

You can use the HDFS option for other routes if there is a connector for Hadoop. For example, you can work with Azure file systems. The steps to use these routes are:

  1. Download the jars hadoop-azure.jar, jetty-util and jetty-util-ajax.

  2. Import the jars in the platform:

    1. Click the menu File > Extension Management. Then, in the tab Libraries, click Import.

    2. Select the jars as resource type and click Add to add the jars.

    3. Restart Virtual DataPort.

  3. Configure the connection according to the Azure file system type:

    1. Azure Data Lake Gen 1

      1. In the URI, enter the path. It is like: adl://<account_name>.azuredatalakestore.net/<path>/<file_name>.

      2. Configure the Hadoop properties. They are necessary to configure the authentication. In the Hadoop documentation you can check the available methods and the properties to configure them. Here is an example with client keys authentication in OAuth2.0:

        Name

        Value

        fs.adl.oauth2.refresh.url

        <URL of OAuth endpoint>

        fs.adl.oauth2.credential

        <Credential value>

        fs.adl.oauth2.client.id

        <Client identifier>

        fs.adl.oauth2.access.token.provider.type

        ClientCredential

    2. Azure Data Lake Storage Gen 2 with shared key authentication

      1. In the URI, enter the path. It is like: abfs://<container>\@<account_name>.dfs.core.windows.net/<path>/<file_name>.

      Note

      The “@” character in the URI must be escaped as shown in the example to avoid the confusion with an environment variable. This does not apply if the configuration is for Bulk Data Load or Object Storage data in Parquet Format.

      1. Configure the following hadoop properties. In the Hadoop documentation you can check the available methods and the properties to configure them. Here is an example with shared key:

        Name

        Value

        fs.azure.account.key.<account_name>.dfs.core.windows.net

        <Access key>

        fs.azure.always.use.ssl

        false

        Note

        SSL usage can be triggered by setting the property fs.azure.always.use.ssl to true or by accessing the resource from a route like this abfss://<container>\@<account_name>.dfs.core.windows.net/<path>/<file_name> (in this alternative the property should be removed).

    3. Azure Data Lake Storage Gen 2 with OAuth2 client credentials

      1. In the URI, enter the path. It is like: abfs://<container>\@<account_name>.dfs.core.windows.net/<path>/<file_name>.

      Note

      The “@” character in the URI must be escaped as shown in the example to avoid the confusion with an environment variable. This does not apply if the configuration is for Bulk Data Load or Object Storage data in Parquet Format.

      1. Configure the following hadoop properties. In the Hadoop documentation you can check the available methods and the properties to configure them. Here is an example with shared key:

        Name

        Value

        fs.azure.account.auth.type

        OAuth

        fs.azure.account.oauth.provider.type

        org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider

        fs.azure.account.oauth2.client.endpoint

        https://login.microsoftonline.com/<directory (tenant) ID>/oauth2/token

        fs.azure.account.oauth2.client.id

        <Application (client) ID>

        fs.azure.account.oauth2.client.secret

        <Application (client) secret>

        Note

        SSL usage can be triggered by accessing the resource from a route like this abfss://<container>\@<account_name>.dfs.core.windows.net/<path>/<file_name>.

    4. Azure Blob File System

      1. In the URI, enter the path. It is like: wasb://<container>\@<account_name>.blob.core.windows.net/<path>/<file_name>.

      Note

      The “@” character in the URI must be escaped as shown in the example to avoid the confusion with an environment variable. This does not apply if the configuration is for Bulk Data Load or Object Storage data in Parquet Format.

      1. Configure the following hadoop properties. In the Hadoop documentation you can check the available methods and the properties to configure them. Here is an example with shared key:

        Name

        Value

        fs.azure.account.key.<account_name>.blob.core.windows.net

        <Access key>

        fs.azure.always.use.ssl

        false

        Note

        SSL usage can be triggered by setting the property fs.azure.always.use.ssl to true or by accessing the resource from a route like this wasbs://<container>\@<account_name>.blob.core.windows.net/<path>/<file_name> (in this alternative the property should be removed).

  4. In the authentication section select None.

Authentication

These are the authentication methods available:

  • None: use this option if the HDFS server does not require authentication.

  • Simple: you have to configure the user name. This authentication mode is equivalent to use the HADOOP_USER_NAME variable when you execute the Hadoop commands in a terminal.

  • Kerberos with user and password: you have to configure the user name and the password.

  • Kerberos with keytab: you have to configure the user name and you have to upload the keytab.

Add feedback