Impala

To be able to use the Impala API to perform bulk data loads, first install the Hadoop client libraries on the host where the Virtual DataPort server runs. To do it, follow these steps:

  1. Find out the specific version of Hadoop you are connecting to.

  2. Go to the Apache Hadoop site and download the binary file corresponding to the version you use. For example hadoop-2.9.0.tar.gz.

  3. Extract the package in the host where the Virtual DataPort server runs. On Linux, run this to uncompress the package:

    tar -xzf <archive_name>.tar.gz
    
  4. Set the following environment variables in the system:

    HADOOP_USER_NAME=<username>
    HADOOP_HOME=<folder where you uncompressed the Hadoop client package>
    

    The Hadoop user name specified must have read and write privileges over the HDFS folder to which the data will be uploaded.

  5. Edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and:

    • In the Hadoop executable location box, enter the path to the file hadoop.cmd on Windows, and hadoop on Linux.
    • In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example: hdfs://acme-node1.denodo.com/user/admin/data/

Note

If in the same host, there are several applications that connect to Hadoop, instead of setting an environment variable, create a script that sets the variables HADOOP_USER_NAME and HADOOP_HOME and then, depending on the platform, invokes hadoop.cmd or hadoop. In the box Hadoop executable location, enter the path to this new script.


At runtime, when a query involves a bulk load to Impala, Virtual DataPort does two things:

  1. It uploads the temporary data files to the Impala distributed file system (HDFS).

    To do this, it executes the command hadoop.cmd fs put... locally, on the host where the Virtual DataPort server runs.

    Depending on the configuration of Impala, sometimes you need to export the environment variable HADOOP_USER_NAME with the user name of the data source so the next step of the process can be completed.

    To execute this command, Virtual DataPort does not use the credentials of the data source. It just executes that command. Therefore, the HDFS system has to be configured to allow connections from this host. For example, by setting up an SSH key on the host where Virtual DataPort runs that allows this connection.

  2. It creates the tables in Impala associated to the data uploaded in the step #1.

    To do this, it connects to Impala using the credentials of the JDBC data source and executes LOAD DATA INPATH .... To complete this command, the user account of the JDBC data source needs to have access to the files uploaded in the first step.