Amazon Athena

To configure a Amazon Athena data source to perform bulk data loads, first install the Hadoop client libraries on the host where the Virtual DataPort server runs. To do it, follow these steps:

  1. Install the Java Development Kit version 8 (JDK) on the host of the Virtual DataPort server. The Hadoop client requires it. A Java Runtime Environment (JRE) is not valid.

  2. Set the JAVA_HOME environment variable to point to the path of this JDK. On Windows, it is quite common for this path to contain spaces, but Hadoop does not support them. If the JDK is installed on a path with spaces, set the environment variable JAVA_HOME to C:\Java and create a symbolic link:

    mklink /d "C:\Java" "C:\Program Files\Java\jdk1.8.0_152"
    

    If you get a privileges error executing this command, click the Windows menu, search for Command Prompt, right-click on it and click Run as administrator.

  3. Denodo uses parquet files to upload to S3, which are generated using Hadoop libraries. Because of this, on Windows systems it is necessary to define the HADOOP_HOME variable, which contains the path where the necessary files hadoop.dll and winutils.exe are located.

    HADOOP_HOME=<folder where the hadoop.dll and winutils.exe files are located >
    
  4. On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:

    • AWS access key id and AWS secret access key: Keys to access to the AWS services.

    • S3 Location: location to upload the files to S3. The format for this location is:

      S3://<S3 bucket name>/[<folder>]
      
    • Server time zone: time zone of the Athena server.

When the Denodo server uploads the data to S3, the data is uploaded in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Athena.

Let us say that you execute a query to load the cache of a view and that the cache database is Athena. The process of loading the cache is the following:

  1. The VDP server creates in Athena the table.

  2. As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.

  3. Once the Server writes 500,000 rows into the file, it closes the file and begins sending it to Athena (to the S3 Location defined in the Athena data source). Simultaneously, it writes the next rows into another file.

    500,000 is the default value of the field “Batch insert size (rows)” of the “Read & Write” tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.

  4. When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Athena even if the files generated previously have not been completely transferred. By default, the Server transfers up to 4 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.

    This limit can be changed by executing the following command from the VQL Shell:

    SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.
    athena.maxS3ParallelUploads' = '<new limit per query>';
    
  5. Once all the files are uploaded, the cache table will contain the cached data because the S3 Location indicated in the creation sentence of the table contains the upload data files.