USER MANUALS

Bulk Data Load

There are three scenarios where Virtual DataPort stores data in a database:

  1. When caching data obtained from the source.

  2. When performing a data movement.

  3. When creating or updating a remote table

By default, to store data in a database, Virtual DataPort executes the SQL statement INSERT to store each row. It uses techniques that speed up the process of executing these statements (e.g. it uses prepared statements and executes them in batch).

Most of the databases provide a proprietary interface to load big amounts of data. These interfaces allow to insert data much faster than by executing INSERT statements and Virtual DataPort is capable of using it for the majority of databases.

When Virtual DataPort executes a query that involves using one of these interfaces, it writes the data into a temporary text file and then, it uses this interface to transfer this file to the database and deletes the file.

Before enabling this feature on a data source or the cache engine, take into account that these interfaces are worth using when the number of rows to insert is tens of thousands or higher. With a lower number of rows, there is no performance increase and sometimes, there even may be a performance decrease.

JDBC data sources can be configured to use the bulk load APIs of the following databases:

  • AlloyDB for PostgreSQL

  • Amazon Athena

  • Amazon Aurora MySQL

  • Amazon Aurora PostgreSQL

  • Amazon Redshift

  • Azure Synapse SQL (previously known as Azure SQL Data Warehouse)

  • Databricks

  • Google BigQuery

  • Hive 2.0.0 and higher

  • IBM DB2

  • Impala

  • Kudu (using Impala 3.x Kudu adapter)

  • Microsoft SQL Server

  • MySQL

  • Netezza

  • Oracle

  • PostgreSQL

  • PrestoDB/Trino

  • Snowflake

  • Spark

  • Teradata

  • Vertica

  • Yellowbrick

Almost all of these databases have something in common: to use their interfaces of bulk data load, the client application (in this case, the Virtual DataPort server) has to write the data to an external file with a specific format. After this, the data is transferred to the database.

By default, the Server creates these temporary files in the installation of Denodo. If you want these files to be written in a different directory, enter a path to that directory in the text field Work path of the Read & Write tab of the data source. These temporary files are deleted once the data is loaded on the database.

Some databases require to install an auxiliary application provided by the database vendor. This application transfers the data file to the database. For example, for Oracle, you need to install sqlldr, for SQL Server, bcp, etc. With others is not necessary because this functionality is included within their JDBC driver.

The following subsections list the databases for which you need to install an external application or you have to change its default settings:

Bulk Data Load on a Distributed Object Storage like HDFS or S3

Databases using Hadoop-compatible storage (Hive, Impala, Presto, Spark and Databricks) require to configure the option “Use Bulk Data Load APIs” in order to insert data into their tables.

In these cases, Denodo:

  • First generates temporary files containing the data to insert in Parquet format,

  • then uploads those files to the specific path configured in this section.

  • Finally, Denodo will make the necessary operations to make sure the database table takes the data from the path provided.

The object storages supported to upload these data are S3 and HDFS. It is also possible to configure routes that are compatible with S3 or with the Hadoop API like Azure ADLS. See section Support for Hadoop-compatible storage for more details on this topic. For detailed settings on the generation of these Parquet files see section Generation of Temporary Files for a Hadoop-compatible Distributed Object Storage.

Virtual DataPort implements two mechanisms to perform this bulk data load:

  • Built-in libraries (recommended): this is the new implementation in Denodo 8.0.

  • External Hadoop installation: used in previous versions of Denodo. It requires to download and configure Hadoop.

When you create a data source only the new implementation is available but if you open an existing data source that was configured with the old implementation you can still use it or you can change to the built-in implementation. The following sub-sections describe each mechanism in detail.

Built-in Implementation

In Design Studio or the Administration Tool, edit the JDBC data source and click the tab Read & Write and select Use bulk data load APIs. In the File System drop-down, select the file system that will be used to upload the data files. The available options are HDFS and S3. You can use other systems like Azure Data Lake Storage that are compatible with the Hadoop API. To do this, select HDFS and provide the necessary Hadoop properties (see section Support for Hadoop-compatible storage).

HDFS

In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example: hdfs://acme-node1.denodo.com/user/admin/data/

If the access to the HDFS is restricted by the Apache Knox gateway, in the section Apache Knox configuration, do this:

  • Authentication: select Basic authentication (this is the authentication method that is currently supported).

  • Base URI: enter the base URI of Apache Knox. For example: https://acme.com:8443/gateway/default.

  • User: specify the user name.

  • Password: introduce the password.

S3

  1. In S3 URI field, enter the URI of the S3 bucket to which Virtual DataPort will upload the data file. For example: s3://my-bucket/

  2. Configure the authentication. There are two ways:

    1. Using AWS access keys:

      • Enter the AWS access key ID and the AWS secret access key.

      • Or, to obtain the credentials from the credentials vault, select Obtain credentials from password vault.

      Optionally, you can enter the AWS IAM role ARN. The AWS user will assume the role to get the necessary privileges to interact with the bucket. The role ARN is the Amazon Resource Name of the IAM role you want to assume. Specify the role ARN like arn:aws:iam::<awsAccountId>:role/<roleName>.

    2. Or automatically obtain the credentials of the database from the AWS instance where this Virtual DataPort server is running. To do this, select Use Denodo AWS instance credentials. Optionally, you can also specify an AWS IAM role ARN to get the necessary privileges.

In the Hadoop properties section you can set the Hadoop properties used in Hadoop configuration files like core-site.xml.

External Hadoop Installation

Note

This option is only available for existing data sources that were created with Hadoop executable using VQL.

To be able to use this approach to perform bulk data loads, follow these steps:

  1. Install the Java Development Kit version 8 (JDK) on the host of Virtual DataPort. Hadoop requires it. A Java Runtime Environment (JRE) is not valid.

  2. Set the JAVA_HOME environment variable to point to the path of this JDK. On Windows, it is quite common for this path to contain spaces, but Hadoop does not support them. If the JDK is installed on a path with spaces, set the environment variable JAVA_HOME to C:\Java and create a symbolic link:

    mklink /d "C:\Java" "C:\Program Files\Java\jdk1.8.0_152"
    

    If you get a privileges error executing this command, click the Windows menu, search for Command Prompt, right-click on it and click Run as administrator.

  3. Find out the specific version of Hadoop you are connecting to and go to the Apache Hadoop site and download the binary file - not the source file - corresponding to the version you use. For example, hadoop-2.9.0.tar.gz.

    Hadoop provides a single file that contains the Hadoop server and the client libraries to access to it; there is not a package that just contains the client libraries.

  4. Decompress this file in the host where the Virtual DataPort server runs.

    On Linux, run this to decompress it:

    tar -xzf <archive_name>.tar.gz
    
  5. On Windows, copy the files inside the directory <DENODO_HOME>/dll/vdp/winutils to <HADOOP_HOME>/bin. Otherwise, the generation of Parquet files and the connection to Hadoop will fail.

  6. Set the following environment variable in the system:

    HADOOP_HOME={<directory where you decompressed Hadoop>}
    
  7. Depending on the authentication method you want to use to connect to Hadoop when uploading the files for bulk data load, choose one of these options:

    1. To use standard authentication – not Kerberos – to connect to Hadoop, set this environment variable in the host where the Virtual DataPort server runs:

      HADOOP_USER_NAME=<username in Hadoop>
      

      This user name must have read and write privileges over the HDFS folder to which the data will be uploaded.

      Note

      If in the same host, there are several applications that connect to Hadoop, instead of setting environment variables, create a script that sets the variables HADOOP_USER_NAME and HADOOP_HOME and then, depending on the platform, invokes hadoop.cmd or hadoop. Later, in the configuration of bulk data load, point to this script.

    2. To use Kerberos authentication to connect to Hadoop, do the following changes:

      1. Edit the file <HADOOP_HOME>/etc/hadoop/core-site.xml and add the following properties:

        <property>
            <name>hadoop.security.authentication</name>
            <value>kerberos</value>
        </property>
        <property>
            <name>hadoop.security.authorization</name>
            <value>true</value>
        </property>
        
      2. Create a script:

        • On Linux, create the file <DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.sh with this content:

          #!/bin/bash
          kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service>
          $HADOOP_HOME/hadoop "$@"
          

          After creating the file, execute this:

          chmod +x <DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.sh
          
        • On Windows, create the file <DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.bat with this content:

          @echo off
          kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service>
          %HADOOP_HOME%\hadoop.cmd %*
          

        Replace <path to the keytab file> with the path to the keytab file that contains the keys to connect to the Hadoop server.

        By invoking kinit before invoking hadoop, we make sure the system has a valid Kerberos ticket to be able to connect to Hadoop.

  8. On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and:

    • In the Hadoop executable location box, enter the path to the file hadoop.cmd on Windows, and hadoop on Linux. If in the previous step you created the script renew_kerberos_ticket_for_bulk_data_load, put the path to this file.

    • In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example: hdfs://acme-node1.denodo.com/user/admin/data/


Add feedback