To use the bulk load mechanism of Impala, Virtual DataPort generates Apache Parquet files and then, transfers them to Impala. To generate these files, Virtual DataPort uses the Apache Hadoop libraries. These libraries are part of the Apache Hadoop, which needs to be present on the computer where Virtual DataPort runs. You do not need to install Hadoop, the only requirement is for these libraries to be present.
Follow these steps:
Install the Java Development Kit version 8 (JDK) on the host of Virtual DataPort.
This is necessary because the Hadoop libraries require it. A Java Runtime Environment (JRE) is not valid.
Set the JAVA_HOME environment variable to point to the path of this JDK. On Windows, it is quite common for this path to contain spaces, but Hadoop does not support them. If the JDK is installed on a path with spaces, set the environment variable JAVA_HOME to C:\Java and create a symbolic link:
mklink /d "C:\Java" "C:\Program Files\Java\jdk1.8.0_152"
If you get a privileges error executing this command, click the Windows menu, search for Command Prompt, right-click on it and click Run as administrator.
Find out the specific version of Hadoop you are connecting to and go to the Apache Hadoop site and download the binary file - not the source file - corresponding to the version you use. For example, hadoop-2.9.0.tar.gz.
Hadoop provides a single file that contains the Hadoop server and the client libraries to access to it; there is not a package that just contains the client libraries.
Decompress this file in the host where the Virtual DataPort server runs.
On Linux, run this to decompress it:
tar -xzf <archive_name>.tar.gz
On Windows, copy the files inside the directory
<HADOOP_HOME>/bin. Otherwise, the generation of Parquet files and the connection to Hadoop will fail.
Set the following environment variable in the system:
HADOOP_HOME=<directory where you decompressed Hadoop>
Depending on the authentication method you want to use to connect to Hadoop when uploading the files for bulk data load, choose one of these options:
To use standard authentication – not Kerberos – to connect to Hadoop, set this environment variable in the host where the Virtual DataPort server runs:
HADOOP_USER_NAME=<username in Hadoop>
This user name must have read and write privileges over the HDFS folder to which the data will be uploaded.
If in the same host, there are several applications that connect to Hadoop, instead of setting environment variables, create a script that sets the variables
HADOOP_HOMEand then, depending on the platform, invokes
hadoop. Later, in the configuration of bulk data load, point to this script.
To use Kerberos authentication to connect to Hadoop, do the following changes:
Edit the file
<HADOOP_HOME>/etc/hadoop/core-site.xmland add the following properties:
<property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.security.authorization</name> <value>true</value> </property>
Create a script:
On Linux, create the file
<DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.shwith this content:
#!/bin/bash kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service> $HADOOP_HOME/hadoop "$@"
After creating the file, execute this:
chmod +x <DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.sh
On Windows, create the file
<DENODO_HOME>/renew_kerberos_ticket_for_bulk_data_load.batwith this content:
@echo off kinit -k -t "<path to the keytab file>" <Kerberos principal name of the Hadoop service> %HADOOP_HOME%\hadoop.cmd %*
<path to the keytab file>with the path to the keytab file that contains the keys to connect to the Hadoop server.
By invoking kinit before invoking
hadoop, we make sure the system has a valid Kerberos ticket to be able to connect to Hadoop.
On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and:
In the Hadoop executable location box, enter the path to the file
hadoop.cmdon Windows, and
hadoopon Linux. If in the previous step you created the script
renew_kerberos_ticket_for_bulk_data_load, put the path to this file.
In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example:
At runtime, when a query involves a bulk load to Impala, Virtual DataPort does two things:
It uploads the temporary data files to the Impala distributed file system (HDFS).
To do this, it executes the command
hadoop.cmd fs put...locally, on the host where the Virtual DataPort server runs.
Depending on the configuration of Impala, sometimes you need to export the environment variable
HADOOP_USER_NAMEwith the user name of the data source so the next step of the process can be completed.
To execute this command, Virtual DataPort does not use the credentials of the data source. It just executes that command. Therefore, the HDFS system has to be configured to allow connections from this host. For example, by setting up an SSH key on the host where Virtual DataPort runs that allows this connection or by allowing connections through Kerberos.
It creates the tables in Impala associated to the data uploaded in the step #1.
To do this, it connects to Impala using the credentials of the JDBC data source and executes
LOAD DATA INPATH .... To complete this command, the user account of the JDBC data source needs to have access to the files uploaded in the first step.