Impala¶
Data loads to Impala require configuring an object storage to upload the data to insert. See section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS to configure the bulk data load to Impala. Virtual DataPort implements two mechanisms to perform bulk data loads to Impala:
Built-in libraries (recommended).
External Hadoop installation: used Denodo 7.0 and earlier. It requires to download and configure Hadoop.
The following sub-sections explains the process followed by Denodo on each scenario.
Data Load to Impala Using Built-in Implementation¶
At runtime, when a query involves a bulk load to Impala, Virtual DataPort uploads the data in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Impala.
Let us say that you execute a query to load the cache of a view and that the cache database is Impala. The process of loading the cache is the following:
The VDP server connects to Impala using the credentials of the JDBC data source and executes a create table statement. It specifies the URI configured in the Read & Write as the location of the data. Notice that the user account of the data source needs to have access to the location.
As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.
Once the Server writes 5,000,000 rows into the file, it closes the file and begins sending it to the location defined in the Impala data source. Simultaneously, it writes the next rows into another file.
5,000,000 is the default value of the field Batch insert size (rows) of the Read & Write tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.
When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Impala even if the files generated previously have not been completely transferred. By default, the Server transfers up to 10 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.
This limit can be changed by executing the following command from the VQL Shell:
SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.hdfs.maxParallelUploads' = '<new limit per query>';
Once all the files are uploaded, the cache table will contain the cached data because the location indicated in the creation sentence of the table contains the upload data files.
Data Load to Impala Using External Hadoop Installation¶
At runtime, when a query involves a bulk load to Impala, Virtual DataPort uploads the data in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Impala.
Let us say that you execute a query to load the cache of a view and that the cache database is Impala. The process of loading the cache is the following:
The VDP server connects to Impala using the credentials of the JDBC data source and executes a create table statement. It specifies the URI configured in the Read & Write as the location of the data. Notice that the user account of the data source needs to have access to the location.
As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.
Once the Server writes 5,000,000 rows into the file, it closes the file and begins sending it to the location defined in the Impala data source. Simultaneously, it writes the next rows into another file.
To do this, it executes the command
hadoop.cmd fs put...
locally, on the host where the Virtual DataPort server runs.Depending on the configuration of Impala, sometimes you need to export the environment variable
HADOOP_USER_NAME
with the user name of the data source so the next step of the process can be completed.To execute this command, Virtual DataPort does not use the credentials of the data source. It just executes that command. Therefore, the HDFS system has to be configured to allow connections from this host. For example, by setting up an SSH key on the host where Virtual DataPort runs that allows this connection or by allowing connections through Kerberos.
5,000,000 is the default value of the field Batch insert size (rows) of the Read & Write tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.
When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Impala even if the files generated previously have not been completely transferred. By default, the Server transfers up to 10 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.
This limit can be changed by executing the following command from the VQL Shell:
SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.hdfs.maxParallelUploads' = '<new limit per query>';
Once all the files are uploaded, the cache table will contain the cached data because the location indicated in the creation sentence of the table contains the upload data files.