Bulk Data Load¶
There are three scenarios where Virtual DataPort stores data in a database:
When caching data obtained from the source.
When performing a data movement.
When creating or updating a remote table or a summary view.
By default, to store data in a database, Virtual DataPort executes the
SQL statement INSERT
to store each row. It uses techniques that
speed up the process of executing these statements (e.g. it uses
prepared statements and executes them in batch).
Most of the databases provide a proprietary interface to load big
amounts of data. These interfaces allow to insert data much faster than
by executing INSERT
statements and Virtual DataPort is capable of
using it for the majority of databases.
When Virtual DataPort executes a query that involves using one of these interfaces, it writes the data into a temporary file and then, it uses this interface to transfer this file to the database and deletes the file.
Before enabling this feature on a data source or the cache engine, take into account that these interfaces are worth using when the number of rows to insert is tens of thousands or higher. With a lower number of rows, there is no performance increase and sometimes, there even may be a performance decrease.
JDBC data sources can be configured to use the bulk load APIs of the following databases:
AlloyDB for PostgreSQL
Amazon Athena
Amazon Aurora MySQL
Amazon Aurora PostgreSQL
Amazon Redshift
Azure Synapse SQL (previously known as Azure SQL Data Warehouse)
Databricks
Denodo Embedded MPP *
Google BigQuery
Hive 2.0.0 and higher
IBM DB2 *
Impala
Kudu (using
Impala 3.x Kudu
adapter)Microsoft SQL Server *
MySQL
Netezza
Oracle *
PostgreSQL
PrestoDB/Trino
Snowflake
Spark
Teradata
Vertica
Yellowbrick *
Note
Bulk data load is not available in Agora for the JDBC data sources marked with *.
Almost all of these databases have something in common: to use their interfaces of bulk data load, the client application (in this case, the Virtual DataPort server) has to write the data to an external file with a specific format. After this, the data is transferred to the database.
By default, the Server creates these temporary files in the installation of Denodo. If you want these files to be written in a different directory, enter a path to that directory in the text field Work path of the Read & Write tab of the data source. These temporary files are deleted once the data is loaded on the database.
Some databases require to install an auxiliary application provided by
the database vendor. This application transfers the data file to the
database. For example, for Oracle, you need to install sqlldr
, for
SQL Server, bcp
, etc. With others is not necessary because this
functionality is included within their JDBC driver.
The following subsections list the databases for which you need to install an external application or you have to change its default settings:
- AlloyDB for PostgreSQL
- Amazon Athena
- Amazon Aurora MySQL
- Amazon Aurora PostgreSQL
- Amazon Redshift
- DB2
- Databricks
- Denodo Embedded MPP
- BigQuery
- Hive
- Impala
- Kudu
- MySQL
- Oracle
- PostgreSQL
- PrestoDB/Trino
- Snowflake
- Spark
- Microsoft SQL Server and Azure Synapse SQL
- Teradata
- Yellowbrick
- Settings of the Generation of the Temporary Files
Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS¶
Databases using Hadoop-compatible storage (Hive, Impala, Presto, Spark, Databricks and Denodo Embedded MPP) require to configure the option “Use Bulk Data Load APIs” in order to insert data into their tables.
In these cases, Denodo:
First generates temporary files containing the data to insert in Parquet format,
then uploads those files to the specific path configured in this section.
Finally, Denodo will make the necessary operations to make sure the database table takes the data from the path provided.
The object storages supported to upload these data are S3, ADLS and HDFS. It is also possible to configure routes that are compatible with S3 or with the Hadoop API. See section Support for Hadoop-compatible storage for more details on this topic. For detailed settings on the generation of these Parquet files see section Generation of Temporary Files for a Hadoop-compatible Distributed Object Storage.
Virtual DataPort implements two mechanisms to perform this bulk data load:
Built-in libraries (recommended).
External Hadoop installation: used Denodo 7.0 and earlier. It requires to download and configure Hadoop.
When you create a data source only the new implementation is available but if you open an existing data source that was configured with the old implementation you can still use it or you can change to the built-in implementation. The following sub-sections describe each mechanism in detail.
Built-in Implementation¶
In Design Studio or the Administration Tool, edit the JDBC data source and click the tab Read & Write and select Use bulk data load APIs. In the File System drop-down, select the file system that will be used to upload the data files. The available options are HDFS, S3 and ADLS. You can use other systems like Google Cloud Storage that are compatible with the Hadoop API. To do this, select HDFS and provide the necessary Hadoop properties (see section Support for Hadoop-compatible storage).
HDFS
In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example: hdfs://acme-node1.denodo.com/user/admin/data/
If the access to the HDFS is restricted by the Apache Knox gateway, in the section Apache Knox configuration, do this:
Authentication: select Basic authentication (this is the authentication method that is currently supported).
Base URI: enter the base URI of Apache Knox. For example:
https://acme.com:8443/gateway/default
.User: specify the user name.
Password: introduce the password.
S3
In S3 URI field, enter the URI of the S3 bucket to which Virtual DataPort will upload the data file. For example:
s3://my-bucket/
Configure the authentication. There are two ways:
Using AWS access keys:
Enter the AWS access key ID and the AWS secret access key.
Or, to obtain the credentials from the credentials vault, select Obtain credentials from password vault.
Optionally, you can enter the AWS IAM role ARN. The AWS user will assume the role to get the necessary privileges to interact with the bucket. The role ARN is the Amazon Resource Name of the IAM role you want to assume. Specify the role ARN like
arn:aws:iam::<awsAccountId>:role/<roleName>
.Or automatically obtain the credentials of AWS S3 from the AWS instance where this Virtual DataPort server is running. To do this, select Use Denodo AWS instance credentials. Optionally, you can also specify an
AWS IAM role ARN
to get the necessary privileges.
ADLS Gen2
In ABFS URI field, enter the URI of the ADLS Gen2 storage account and container to which Virtual DataPort will upload the data file. For example:
abfs://<container>@<storage acount>.dfs.core.windows.net/
Configure the authentication. There are three ways:
Using Azure shared key.
Using OAuth 2.0 client credentials:
Enter the Token endpoint, the Client identifier and the Client secret.
Or, to obtain the credentials from the credentials vault, select Obtain credentials from password vault.
Using Azure managed identity: Automatically obtain the credentials of the ADLS Gen2 from the Azure Virtual Machine where this Virtual DataPort server is running. Optionally, you can also specify the Token endpoint, the Client identifier and the Tenant identifier.
In the Hadoop properties section you can set the Hadoop properties used in Hadoop configuration files like core-site.xml
.
External Hadoop Installation¶
Note
This option is only available for existing data sources that were created with Hadoop executable using VQL.
To be able to use this approach to perform bulk data loads, follow these steps:
Install the Java Development Kit version 8 (JDK) on the host of Virtual DataPort. Hadoop requires it. A Java Runtime Environment (JRE) is not valid.
Set the JAVA_HOME environment variable to point to the path of this JDK. On Windows, it is quite common for this path to contain spaces, but Hadoop does not support them. If the JDK is installed on a path with spaces, set the environment variable JAVA_HOME to C:\Java and create a symbolic link:
mklink /d "C:\Java" "C:\Program Files\Java\jdk1.8.0_152"
If you get a privileges error executing this command, click the Windows menu, search for Command Prompt, right-click on it and click Run as administrator.
Find out the specific version of Hadoop you are connecting to and go to the Apache Hadoop site and download the binary file - not the source file - corresponding to the version you use. For example, hadoop-2.9.0.tar.gz.
Hadoop provides a single file that contains the Hadoop server and the client libraries to access to it; there is not a package that just contains the client libraries.
Decompress this file in the host where the Virtual DataPort server runs.
On Linux, run this to decompress it:
tar -xzf <archive_name>.tar.gz
On Windows, copy the files inside the directory
<DENODO_HOME>/dll/vdp/winutils
to<HADOOP_HOME>/bin
. Otherwise, the generation of Parquet files and the connection to Hadoop will fail.Set the following environment variable in the system:
HADOOP_HOME={<directory where you decompressed Hadoop>}
Depending on the authentication method you want to use to connect to Hadoop when uploading the files for bulk data load, choose one of these options:
To use standard authentication – not Kerberos – to connect to Hadoop, set this environment variable in the host where the Virtual DataPort server runs:
HADOOP_USER_NAME=<username in Hadoop>
This user name must have read and write privileges over the HDFS folder to which the data will be uploaded.
Note
If in the same host, there are several applications that connect to Hadoop, instead of setting environment variables, create a script that sets the variables
HADOOP_USER_NAME
andHADOOP_HOME
and then, depending on the platform, invokeshadoop.cmd
orhadoop
. Later, in the configuration of bulk data load, point to this script.To use Kerberos authentication to connect to Hadoop, do the following changes:
Edit the file
<HADOOP_HOME>/etc/hadoop/core-site.xml
and add the following properties:<property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property>
Create a script:
On Linux, create the file
<DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.sh
with this content:#!/bin/bash kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service> $HADOOP_HOME/hadoop "$@"
After creating the file, execute this:
chmod +x <DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.sh
On Windows, create the file
<DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.bat
with this content:@echo off kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service> %HADOOP_HOME%\hadoop.cmd %*
Replace
<path to the keytab file>
with the path to the keytab file that contains the keys to connect to the Hadoop server.By invoking
kinit
before invokinghadoop
, we make sure the system has a valid Kerberos ticket to be able to connect to Hadoop.
On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and:
In the Hadoop executable location box, enter the path to the file
hadoop.cmd
on Windows, andhadoop
on Linux. If in the previous step you created the scriptrenew_kerberos_ticket_for_bulk_data_load
, put the path to this file.In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example:
hdfs://acme-node1.denodo.com/user/admin/data/