Amazon Athena

Follow these steps to configure an Amazon Athena data source to perform bulk data loads:

  1. When Virtual DataPort runs on Windows, this features requires the Microsoft Visual C++ 2010 Redistributable Package (or a newer version). If you are unsure if it is installed, download it and install it anyway. The installer of this package will detect if it is already installed.

  2. Log in to Virtual DataPort, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:

    • Authentication type: There are two ways to configure the AWS credentials, specifying the access key id or using the Denodo AWS instance credentials.

    • Obtain credentials from password vault. Select this if you want the data source to obtain the AWS keys from the Credentials Vault.

    • Account name. If you select Obtain credentials from password vault, enter the account in the vault that contains the credentials.

    • AWS access key id and AWS secret access key: Keys to access to the AWS services.

    • AWS IAM role ARN: The Amazon Resource Name of the IAM role to be assumed by the user. It is an optional parameter. The role will grant the necessaries privileges to manage the AWS S3 bucket. The role ARN format is arn:aws:iam::<awsAccountId>:role/<roleName>.

    • S3 Location: location to upload the files to S3. The format for this location is:

      S3://<S3 bucket name>/[<folder>]

When the Denodo server uploads the data to S3, the data is uploaded in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Athena.

Let us say that you execute a query to load the cache of a view and that the cache database is Athena. The process of loading the cache is the following:

  1. The VDP server creates in Athena the table.

  2. As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.

  3. Once the Server writes 500,000 rows into the file, it closes the file and begins sending it to Athena (to the S3 Location defined in the Athena data source). Simultaneously, it writes the next rows into another file.

    500,000 is the default value of the field “Batch insert size (rows)” of the “Read & Write” tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.

  4. When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Athena even if the files generated previously have not been completely transferred. By default, the Server transfers up to 4 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.

    This limit can be changed by executing the following command from the VQL Shell:

    SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.
    athena.maxS3ParallelUploads' = '<new limit per query>';
  5. Once all the files are uploaded, the cache table will contain the cached data because the S3 Location indicated in the creation sentence of the table contains the upload data files.