Specific Information about Cache Databases

This section contains information you have to take into account depending on the database you use to store cached data.

Amazon Athena

When using Athena to store cached data, take the following into account:

  • To use Amazon Athena as cache database, you have to enable Bulk Data Load to upload the data to S3.

  • In the tab, Read & Write, select Use bulk data load APIs and fill the fields Authentication type, AWS access key id, AWS secret access key, AWS IAM role ARN (optional), and the S3 Location. Otherwise, Virtual DataPort will not be able to cache data.

  • The AWS IAM role ARN is the Amazon Resource Name of the IAM role to be assumed by the user. It is an optional field. The role will grant the necessaries privileges to manage the AWS S3 bucket. The format of this value is arn:aws:iam::<awsAccountId>:role/<roleName>.

  • When using Athena as cache database, you can only enable the cache mode full on the views. If you enable partial cache, the data will not be cached.

Azure SQL Data Warehouse

The Azure SQL Data Warehouse data source as cache have some limitations:

  • Azure SQL Data Warehouse does not support unique indexes in tables.

  • Data movements and creating materialized tables cannot be performed within a transaction.

  • Binaries and complex type columns (arrays and registers) are limited to 8,000 bytes. If this size is exceeded during a cache load process, it will end with the error “String or binary data would be truncated”.

  • The only available isolation level is READ UNCOMMITTED.

Make sure that in the configuration of the cache, in the box Driver class path, the option mssql-jdbc-7.x is selected. Otherwise, some features may not work properly.

Presto

When using Presto to store cached data, take the following into account:

  • You can only use it when the data is stored in the Hadoop Distributed File System (HDFS).

  • In the tab, Read & Write, select Use bulk data load APIs and fill the fields HDFS URI and Hadoop executable location. Otherwise, Virtual DataPort will not be able to cache data.

  • When using Presto to store cached data, you must not select the cache mode Partial on a view. The data will not be cached.

  • Virtual DataPort transfers data into Presto using non-managed (external) Hive tables. By using external tables, the URI configured to upload the data files can be different than the location of the schema used for caching or perform data movements. Make sure the configuration property of Presto hive.non-managed-table-writes-enabled is true; otherwise, the process of storing data in Presto will fail. To restore the behavior of previous versions of Denodo (i.e. not using external tables), execute this:

    SET 'com.denodo.vdb.util.tablemanagement.sql.PrestoTableManager.useExternalTables'='false';
    

    You do not need to restart to apply this change.

Spark

During the process of invalidating the cache, some queries may fail if they are executed while a Parquet file is changing.

Yellowbrick

Yellowbrick supports two types of encoding for its databases: LATIN9 (default) and UTF-8. Keep in mind that Virtual DataPort can provide full support only to databases using encoding UTF-8. You can select the database encoding when creating the database using encoding = UTF-8.

If a database with encoding LATIN9 is used as a cache data source, data from invalid characters may be lost when caching from other data sources.

Databases with HDFS Storage

Databases that use HDFS storage (Hive, Impala, Presto, Spark and Databricks) only support full cache mode. Any view with partial cache will be ignored and behave as if it has no cache.

The temporary files are generated using snappy compression. If you want to change the Parquet files compression, then execute this command from the VQL Shell:

SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.HdfsInsertWorker.parquet.compression'
    = '<compression mechanism>';

The value of “<compression mechanism>” can be off (no compression), snappy, gzip or lzo.

You do not need to restart after changing this property.

The database needs to support the selected compression algorithm.

Generating these files compressed may speed up the process loading the cache of a view or a data movement. However, it may not be the case if the database is in the same local area network as the Denodo server because the transfer speed is high. It is possible that there is not any reduction in time and an increase of CPU usage by the Denodo server because it has to generate the files compressed.

Besides, these databases do only support Full Cache invalidation.