Amazon Redshift¶
Redshift provides two method of uploading to Redshift the data file that Virtual DataPort has generated:
Upload the data file to an S3 bucket and from there, to the database (recommended method). To use this method, in the Read & Write tab of the data source, do this:
Select Use bulk data load APIs.
Select the Authentication type.
Enter the value of AWS access key id, AWS secret access key, or select Obtain credentials from password vault to obtain the credentials from the credentials vault.
Enter the value of AWS IAM role ARN (optional).
Enter the S3 bucket name.
Leave empty the boxes whose label starts with SSH.
The AWS IAM role ARN (optional field) is the Amazon Resource Name of the IAM role to be assumed by the user. The role will grant the privileges required to manage the AWS S3 bucket. The role ARN format is
arn:aws:iam::<awsAccountId>:role/<roleName>
.Alternatively, grant access to Redshift to connect via SSH to a host where it can find the data file. This method can only be used if the host where the data file is stored is Linux.
To use this method, in addition to provide the same values as in the previous method, enter a value for the boxes below (the ones starting with “SSH”):
SSH endpoint: host to which the Redshift instance will connect to retrieve the data file.
SSH work path: directory where the Redshift instance will locate the data file.
SSH user name: user name that the Redshift instance will use to connect to this host.
SSH public key: public key that the Redshift instance will use to connect to this host. Note that you will have to configure the host to allow connections with this public key; and not with a user name and password.
When the Denodo server is the one that uploads the data to Amazon Redshift, the process is more efficient than with other adapters. The data is uploaded in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to Redshift.
Let us say that you execute a query to load the cache of a view and that the cache database is Redshift. The process of loading the cache is the following:
The Server executes the query and begins to obtain the rows it has to store in the cache database; as soon as it obtains the first rows of the result, it starts writing them to the data file. This file is written to disk compressed.
Once the Server writes 500,000 rows into the file, it closes the file and begins sending it to Redshift. Simultaneously, it writes the next rows into another file.
500,000 is the default value of the field “Batch insert size (rows)” of the “Read & Write” tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.
When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Redshift even if the files generated previously have not been completely transferred. By default, the Server transfers up to 4 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.
This limit can be changed by executing the following command from the VQL Shell:
SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker. redshift.maxS3ParallelUploads' = '<new limit per query>';
Once a data file is sent, it is deleted.
Once all the files are uploaded, the Server instructs Redshift to load these files.
Note
If you configure the data source to use the second alternative (i.e. Redshift connects to one of your hosts and reads the data files), the explanation above does not apply. Instead, what happens is that Virtual DataPort begins writing data to disk. Once the query finishes and all the data is written to disk, it instructs Redshift to load these files. This approach is slower than the first approach (the recommended one), especially when dealing with hundreds of millions of rows, because Redshift does not begin to retrieve data until the query finishes and all the data is written on the files. With the first approach Virtual DataPort begins uploading data as soon as the first file is completed (i.e. the file contains the number of rows indicated in the field “Batch insert size”).
Configuring the S3 Bucket AWS Region
Virtual DataPort uses the AWS region “us-west-2” by default.
If the S3 bucket needs to be accessed from any other AWS region (due to firewall rules, DNS configuration,…), you can configure it with one of the following options:
With the configuration parameter
com.denodo.vdb.util.tablemanagement.aws.s3.region.<database_name>.<datasource_name>
, set the required AWS region.<database_name>
and<datasource_name>
are optional:To change the AWS region of a specific data source, execute this:
-- Changing the AWS region of the S3 bucket for the data source "ds_jdbc_aws" of the database "customer360" SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region.customer360.ds_jdbc_aws' = 'us-east1';
To change the AWS region of the S3 buckets of Amazon Athena and Amazon Redshift used by all the data sources of a database, execute this:
-- Changing the AWS region of the S3 bucket for the data sources of the database "customer360" SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region.customer360' = 'us-east1';
To change the AWS region of the S3 buckets of Amazon Athena and Amazon Redshift used by the data sources of all the databases, execute this:
SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region' = 'us-east2';
To remove any of these properties (i.e. go back to the default value), set the property to NULL. For example:
SET 'com.denodo.vdb.util.tablemanagement.aws.s3.region' = NULL;
If none of these configuration parameters are defined, Denodo will try to retrieve the AWS region from:
The environment variable
AWS_REGION
.The JVM option
aws.region
.The shared AWS config file.
The EC2 Metadata service if the Virtual DataPort server is running on EC2.
The changes to the parameter com.denodo.vdb.util.tablemanagement.aws.s3.region
(and the same parameter specific to a data source or database) are applied immediately; you do not need to restart Virtual DataPort.