Impala¶
Virtual DataPort implements two mechanisms to perform bulk data loads to Impala:
Built-in libraries: this is the new implementation in Denodo 8.0.
External Hadoop installation: used in previous versions of Denodo. It requires to download and configure Hadoop.
When you create a data source only the new implementation is available but if you open an existing data source that was configured with the old implementation you can still use it or you can change to the built-in implementation.
Built-in Implementation¶
On the Administration Tool, edit the JDBC data source and click the tab Read & Write and select Use bulk data load APIs. In the File System drop-down, select the file system that will be used to upload the data files. The available options are HDFS and S3. You can use other systems like Azure Data Lake Storage that are compatible with the Hadoop API. To do this, select HDFS and provide the necessary Hadoop properties (see section Support for Hadoop-compatible routes).
HDFS
In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example: hdfs://acme-node1.denodo.com/user/admin/data/
In some cases, the access to HDFS is retricted via Apache Knox gateway. If it is your scenario, you have to provide this configuration in the Apache Knox configuration section:
Authentication: only basic authentication is supported. In the drop-down select Basic authentication.
Base URI: put the base URI of Apache Knox. For example:
https://acme.com:8443/gateway/default
.User: specify the user name.
Password: introduce the password.
S3
In S3 URI field, enter the URI of the S3 bucket to which Virtual DataPort will upload the data file. For example:
s3://my-bucket/
Configure the authentication. There are two ways:
Using AWS access keys:
Enter the AWS access key ID and the AWS secret access key.
Or, to obtain the credentials from the credentials vault, select Obtain credentials from password vault.
Optionally, you can enter the AWS IAM role ARN. The AWS user will assume the role to get the necessary privileges to interact with the bucket. The role ARN is the Amazon Resource Name of the IAM role you want to assume. Specify the role ARN like
arn:aws:iam::<awsAccountId>:role/<roleName>
.Or automatically obtain the credentials of the database from the AWS instance where this Virtual DataPort server is running. To do this, select Use Denodo AWS instance credentials. Optionally, you can also specify an
AWS IAM role ARN
to get the necessary privileges.
In the Hadoop properties section you can set the Hadoop properties used in Hadoop configuration files like core-site.xml
.
At runtime, when a query involves a bulk load to Impala, Virtual DataPort uploads the data in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Impala.
Let us say that you execute a query to load the cache of a view and that the cache database is Impala. The process of loading the cache is the following:
The VDP server connects to Impala using the credentials of the JDBC data source and executes a create table statement. It specifies the URI configured in the Read & Write as the location of the data. Notice that the user account of the data source needs to have access to the location.
As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.
Once the Server writes 5,000,000 rows into the file, it closes the file and begins sending it to the location defined in the Impala data source. Simultaneously, it writes the next rows into another file.
5,000,000 is the default value of the field Batch insert size (rows) of the Read & Write tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.
When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Impala even if the files generated previously have not been completely transferred. By default, the Server transfers up to 10 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.
This limit can be changed by executing the following command from the VQL Shell:
SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.hdfs.maxParallelUploads' = '<new limit per query>';
Once all the files are uploaded, the cache table will contain the cached data because the location indicated in the creation sentence of the table contains the upload data files.
External Hadoop Installation¶
Note
This option is only available for existing data sources that were created with Hadoop executable using VQL.
To be able to use this approach to perform bulk data loads, follow these steps:
Install the Java Development Kit version 8 (JDK) on the host of Virtual DataPort. Hadoop requires it. A Java Runtime Environment (JRE) is not valid.
Set the JAVA_HOME environment variable to point to the path of this JDK. On Windows, it is quite common for this path to contain spaces, but Hadoop does not support them. If the JDK is installed on a path with spaces, set the environment variable JAVA_HOME to C:\Java and create a symbolic link:
mklink /d "C:\Java" "C:\Program Files\Java\jdk1.8.0_152"
If you get a privileges error executing this command, click the Windows menu, search for Command Prompt, right-click on it and click Run as administrator.
Find out the specific version of Hadoop you are connecting to and go to the Apache Hadoop site and download the binary file - not the source file - corresponding to the version you use. For example, hadoop-2.9.0.tar.gz.
Hadoop provides a single file that contains the Hadoop server and the client libraries to access to it; there is not a package that just contains the client libraries.
Decompress this file in the host where the Virtual DataPort server runs.
On Linux, run this to decompress it:
tar -xzf <archive_name>.tar.gz
On Windows, copy the files inside the directory
<DENODO_HOME>/dll/vdp/winutils
to<HADOOP_HOME>/bin
. Otherwise, the generation of Parquet files and the connection to Hadoop will fail.Set the following environment variable in the system:
HADOOP_HOME={<directory where you decompressed Hadoop>}
Depending on the authentication method you want to use to connect to Hadoop when uploading the files for bulk data load, choose one of these options:
To use standard authentication – not Kerberos – to connect to Hadoop, set this environment variable in the host where the Virtual DataPort server runs:
HADOOP_USER_NAME=<username in Hadoop>
This user name must have read and write privileges over the HDFS folder to which the data will be uploaded.
Note
If in the same host, there are several applications that connect to Hadoop, instead of setting environment variables, create a script that sets the variables
HADOOP_USER_NAME
andHADOOP_HOME
and then, depending on the platform, invokeshadoop.cmd
orhadoop
. Later, in the configuration of bulk data load, point to this script.To use Kerberos authentication to connect to Hadoop, do the following changes:
Edit the file
<HADOOP_HOME>/etc/hadoop/core-site.xml
and add the following properties:<property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property>
Create a script:
On Linux, create the file
<DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.sh
with this content:#!/bin/bash kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service> $HADOOP_HOME/hadoop "$@"
After creating the file, execute this:
chmod +x <DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.sh
On Windows, create the file
<DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.bat
with this content:@echo off kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service> %HADOOP_HOME%\hadoop.cmd %*
Replace
<path to the keytab file>
with the path to the keytab file that contains the keys to connect to the Hadoop server.By invoking
kinit
before invokinghadoop
, we make sure the system has a valid Kerberos ticket to be able to connect to Hadoop.
On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and:
In the Hadoop executable location box, enter the path to the file
hadoop.cmd
on Windows, andhadoop
on Linux. If in the previous step you created the scriptrenew_kerberos_ticket_for_bulk_data_load
, put the path to this file.In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example:
hdfs://acme-node1.denodo.com/user/admin/data/
At runtime, when a query involves a bulk load to Impala, Virtual DataPort uploads the data in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Impala.
Let us say that you execute a query to load the cache of a view and that the cache database is Impala. The process of loading the cache is the following:
The VDP server connects to Impala using the credentials of the JDBC data source and executes a create table statement. It specifies the URI configured in the Read & Write as the location of the data. Notice that the user account of the data source needs to have access to the location.
As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.
Once the Server writes 5,000,000 rows into the file, it closes the file and begins sending it to the location defined in the Impala data source. Simultaneously, it writes the next rows into another file.
To do this, it executes the command
hadoop.cmd fs put...
locally, on the host where the Virtual DataPort server runs.Depending on the configuration of Impala, sometimes you need to export the environment variable
HADOOP_USER_NAME
with the user name of the data source so the next step of the process can be completed.To execute this command, Virtual DataPort does not use the credentials of the data source. It just executes that command. Therefore, the HDFS system has to be configured to allow connections from this host. For example, by setting up an SSH key on the host where Virtual DataPort runs that allows this connection or by allowing connections through Kerberos.
5,000,000 is the default value of the field Batch insert size (rows) of the Read & Write tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.
When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Impala even if the files generated previously have not been completely transferred. By default, the Server transfers up to 10 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.
This limit can be changed by executing the following command from the VQL Shell:
SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.hdfs.maxParallelUploads' = '<new limit per query>';
Once all the files are uploaded, the cache table will contain the cached data because the location indicated in the creation sentence of the table contains the upload data files.