Before using the Impala API to perform bulk data loads, install the Hadoop command line utility on the host where the Virtual DataPort server runs. It is included in the client distribution, which can be downloaded from the Hadoop site.
After this, when enabling “Use bulk data load API” on the “Read & Write” tab of the data source, do the following:
- In the Hadoop executable location box, enter the path to the
hadoop.cmdon Windows and
- In HDFS URI, enter the URI to which Virtual DataPort will upload
the data file. For example:
At runtime, to upload the data, Virtual DataPort does two things:
It uploads the temporary data files to the Impala distributed file system (HDFS).
To do this, it executes the command
hadoop.cmd fs put...locally, on the host where the Virtual DataPort server runs.
Depending on the configuration of Impala, sometimes you need to export the environment variable
HADOOP_USER_NAMEwith the user name of the data source so the next step of the process can be completed.
To execute this command, Virtual DataPort does not use the credentials of the data source. It just executes that command. Therefore, the HDFS system has to be configured to allow connections from this host. For example, by setting up an SSH key on the host where Virtual DataPort runs that allows this connection.
It creates the tables in Impala associated to the data uploaded in the step #1.
To do this, it connects to Impala using the credentials of the JDBC data source and executes
LOAD DATA INPATH .... To complete this command, the user needs to have access to the files uploaded in the first step.