Amazon Athena

Follow these steps to configure an Amazon Athena data source to perform bulk data loads:

  1. If Virtual DataPort runs on Linux, go to the next step. If it runs on Windows, check if the environment variable HADOOP_HOME is defined on this computer. To see the list of environment variables, open a command line and execute SET.

    If this environment variable is already defined, copy the content of the directory <DENODO_HOME>\dll\vdp\winutils to %HADOOP_HOME%\bin.

    If the environment variable is undefined, do this:

    1. Create a directory. For example, <DENODO_HOME>\hadoop_win_utils.

    2. Create a directory called bin within the new directory. For example, <DENODO_HOME>\hadoop_win_utils\bin.

    3. Define the environment variable HADOOP_HOME to point to <DENODO_HOME>\hadoop_win_utils.

    4. Copy the content of the directory <DENODO_HOME>\dll\vdp\winutils to %HADOOP_HOME%\bin.

    This is necessary because, for the bulk load process, Denodo generates Apache Parquet files with the data and then, uploads them to Amazon S3. During this process, the Apache libraries that generate these files, when running on Windows, execute %HADOOP_HOME%\bin\winutils.exe`.

  2. On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:

    • Authentication type: There are two ways to configure the AWS credentials, specifying the access key id or using the Denodo AWS instance credentials.

    • AWS access key id and AWS secret access key: Keys to access to the AWS services.

    • AWS IAM role ARN: The Amazon Resource Name of the IAM role to be assumed by the user. It is an optional parameter. The role will grant the necessaries privileges to manage the AWS S3 bucket. The role ARN format is arn:aws:iam::<awsAccountId>:role/<roleName>.

    • S3 Location: location to upload the files to S3. The format for this location is:

      S3://<S3 bucket name>/[<folder>]
    • Server time zone: time zone of the Athena server.

When the Denodo server uploads the data to S3, the data is uploaded in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Athena.

Let us say that you execute a query to load the cache of a view and that the cache database is Athena. The process of loading the cache is the following:

  1. The VDP server creates in Athena the table.

  2. As soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.

  3. Once the Server writes 500,000 rows into the file, it closes the file and begins sending it to Athena (to the S3 Location defined in the Athena data source). Simultaneously, it writes the next rows into another file.

    500,000 is the default value of the field “Batch insert size (rows)” of the “Read & Write” tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.

  4. When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Athena even if the files generated previously have not been completely transferred. By default, the Server transfers up to 4 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.

    This limit can be changed by executing the following command from the VQL Shell:

    SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.
    athena.maxS3ParallelUploads' = '<new limit per query>';
  5. Once all the files are uploaded, the cache table will contain the cached data because the S3 Location indicated in the creation sentence of the table contains the upload data files.

Configuring the S3 bucket AWS Region

Virtual DataPort uses the AWS region “us-west-2” by default.

If the S3 bucket needs to be accessed from any other AWS region (due to firewall rules, DNS configuration,…), you can configure it with one of the following options:

  1. With the configuration parameter<database_name>.<datasource_name>, set the required AWS region. <database_name> and <datasource_name> are optional:

    • To change the AWS region of a specific data source, execute this:

      -- Changing the AWS region of the S3 bucket for the data source "ds_jdbc_aws" of the database "customer360"
      SET '' = 'us-east1';
    • To change the AWS region of the S3 buckets of Amazon Athena and Amazon Redshift used by all the data sources of a database, execute this:

      -- Changing the AWS region of the S3 bucket for the data sources of the database "customer360"
      SET '' = 'us-east1';
    • To change the AWS region of the S3 buckets of Amazon Athena and Amazon Redshift used by the data sources of all the databases, execute this:

      SET '' = 'us-east2';

    To remove any of these properties (i.e. go back to the default value), set the property to NULL. For example:

    SET '' = NULL;
  2. If none of these configuration parameters are defined, Denodo will try to retrieve the AWS region from:

    • The environment variable AWS_REGION.

    • The JVM option aws.region.

    • The shared AWS config file.

    • The EC2 Metadata service if the Virtual DataPort server is running on EC2.

The changes to the parameter (and the same parameter specific to a data source or database) are applied immediately; you do not need to restart Virtual DataPort.

Add feedback