Follow these steps to configure an Amazon Athena data source to perform bulk data loads:
If Virtual DataPort runs on Linux, go to the next step. If it runs on Windows, check if the environment variable
HADOOP_HOMEis defined on this computer. To see the list of environment variables, open a command line and execute
If this environment variable is already defined, copy the content of the directory
If the environment variable is undefined, do this:
Create a directory. For example,
Create a directory called
binwithin the new directory. For example,
Define the environment variable
HADOOP_HOMEto point to
Copy the content of the directory
This is necessary because, for the bulk load process, Denodo generates Apache Parquet files with the data and then, uploads them to Amazon S3. During this process, the Apache libraries that generate these files, when running on Windows, execute
On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:
Authentication type: There are two ways to configure the AWS credentials, specifying the access key id or using the Denodo AWS instance credentials.
AWS access key id and AWS secret access key: Keys to access to the AWS services.
AWS IAM role ARN: The Amazon Resource Name of the IAM role to be assumed by the user. It is an optional parameter. The role will grant the necessaries privileges to manage the AWS S3 bucket. The role ARN format is
S3 Location: location to upload the files to S3. The format for this location is:
S3://<S3 bucket name>/[<folder>]
Server time zone: time zone of the Athena server.
When the Denodo server uploads the data to S3, the data is uploaded in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Athena.
Let us say that you execute a query to load the cache of a view and that the cache database is Athena. The process of loading the cache is the following:
The VDP server creates in Athena the table.
As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.
Once the Server writes 500,000 rows into the file, it closes the file and begins sending it to Athena (to the S3 Location defined in the Athena data source). Simultaneously, it writes the next rows into another file.
500,000 is the default value of the field “Batch insert size (rows)” of the “Read & Write” tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.
When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Athena even if the files generated previously have not been completely transferred. By default, the Server transfers up to 4 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.
This limit can be changed by executing the following command from the VQL Shell:
SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker. athena.maxS3ParallelUploads' = '<new limit per query>';
Once all the files are uploaded, the cache table will contain the cached data because the S3 Location indicated in the creation sentence of the table contains the upload data files.