Impala

Before using the Impala API to perform bulk data loads, install the Hadoop command line utility on the host where the Virtual DataPort server runs. It is included in the client distribution, which can be downloaded from the Hadoop site.

After this, when enabling “Use bulk data load API” on the “Read & Write” tab of the data source, do the following:

  • In the Hadoop executable location box, enter the path to the executable hadoop.cmd on Windows and hadoop on Linux.
  • In HDFS URI, enter the URI to which Virtual DataPort will upload the data file. For example: hdfs://acme-node1.denodo.com/user/admin/data/

At runtime, to upload the data, Virtual DataPort does two things:

  1. It uploads the temporary data files to the Impala distributed file system (HDFS).

    To do this, it executes the command hadoop.cmd fs put... locally, on the host where the Virtual DataPort server runs.

    Depending on the configuration of Impala, sometimes you need to export the environment variable HADOOP_USER_NAME with the user name of the data source so the next step of the process can be completed.

    To execute this command, Virtual DataPort does not use the credentials of the data source. It just executes that command. Therefore, the HDFS system has to be configured to allow connections from this host. For example, by setting up an SSH key on the host where Virtual DataPort runs that allows this connection.

  2. It creates the tables in Impala associated to the data uploaded in the step #1.

    To do this, it connects to Impala using the credentials of the JDBC data source and executes LOAD DATA INPATH .... To complete this command, the user needs to have access to the files uploaded in the first step.