Amazon Athena

Follow these steps to configure an Amazon Athena data source to perform bulk data loads:

  1. When Virtual DataPort runs on Windows, this features requires the Microsoft Visual C++ 2010 Redistributable Package (use this version, not more modern ones). If you are unsure if it is installed, download it and install it anyway. The installer of this package will detect if it is already installed.

  2. Log in to Virtual DataPort, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:

    • Authentication type: There are two ways to configure the AWS credentials, specifying the access key id or using the Denodo AWS instance credentials.

    • Obtain credentials from password vault. Select this if you want the data source to obtain the AWS keys from the Credentials Vault.

    • Account name. If you select Obtain credentials from password vault, enter the account in the vault that contains the credentials.

    • AWS access key id and AWS secret access key: Keys to access to the AWS services.

    • AWS IAM role ARN: The Amazon Resource Name of the IAM role to be assumed by the user. It is an optional parameter. The role will grant the necessaries privileges to manage the AWS S3 bucket. The role ARN format is arn:aws:iam::<awsAccountId>:role/<roleName>.

    • S3 Location: location to upload the files to S3. The format for this location is:

      S3://<S3 bucket name>/[<folder>]
      

When the Denodo server uploads the data to S3, the data is uploaded in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Athena.

Let us say that you execute a query to load the cache of a view and that the cache database is Athena. The process of loading the cache is the following:

  1. The VDP server creates in Athena the table.

  2. As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.

  3. Once the Server writes 500,000 rows into the file, it closes the file and begins sending it to Athena (to the S3 Location defined in the Athena data source). Simultaneously, it writes the next rows into another file.

    500,000 is the default value of the field “Batch insert size (rows)” of the “Read & Write” tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.

  4. When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Athena even if the files generated previously have not been completely transferred. By default, the Server transfers up to 4 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.

    This limit can be changed by executing the following command from the VQL Shell:

    SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.
    athena.maxS3ParallelUploads' = '<new limit per query>';
    
  5. Once all the files are uploaded, the cache table will contain the cached data because the S3 Location indicated in the creation sentence of the table contains the upload data files.

Configuring the S3 Bucket AWS Region

Virtual DataPort uses the AWS region “us-west-2” by default.

If the S3 bucket needs to be accessed from any other AWS region (due to firewall rules, DNS configuration,…), you can configure it with one of the following options:

  1. With the configuration parameter com.denodo.vdb.util.tablemanagement.aws.s3.region.<database_name>.<datasource_name>, set the required AWS region. <database_name> and <datasource_name> are optional:

    • To change the AWS region of a specific data source, execute this:

      -- Changing the AWS region of the S3 bucket for the data source "ds_jdbc_aws" of the database "customer360"
      SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region.customer360.ds_jdbc_aws' = 'us-east1';
      
    • To change the AWS region of the S3 buckets of Amazon Athena and Amazon Redshift used by all the data sources of a database, execute this:

      -- Changing the AWS region of the S3 bucket for the data sources of the database "customer360"
      SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region.customer360' = 'us-east1';
      
    • To change the AWS region of the S3 buckets of Amazon Athena and Amazon Redshift used by the data sources of all the databases, execute this:

      SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region' = 'us-east2';
      

    To remove any of these properties (i.e. go back to the default value), set the property to NULL. For example:

    SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region' = NULL;
    
  2. If none of these configuration parameters are defined, Denodo will try to retrieve the AWS region from:

    • The environment variable AWS_REGION.

    • The JVM option aws.region.

    • The shared AWS config file.

    • The EC2 Metadata service if the Virtual DataPort server is running on EC2.

The changes to the parameter com.denodo.vdb.util.tablemanagement.aws.s3.region (and the same parameter specific to a data source or database) are applied immediately; you do not need to restart Virtual DataPort.