USER MANUALS


Specific Information about Cache Databases

This section contains information you have to take into account depending on the database you use to store cached data.

Amazon Athena

When using Athena to store cached data, take the following into account:

  • To use Amazon Athena as cache database, you have to enable Bulk Data Load to upload the data to S3.

  • In the tab, Read & Write, select Use bulk data load APIs and fill the fields Authentication type, AWS access key id, AWS secret access key, AWS IAM role ARN (optional), and the S3 Location. Otherwise, Virtual DataPort will not be able to cache data.

  • The AWS IAM role ARN is the Amazon Resource Name of the IAM role to be assumed by the user. It is an optional field. The role will grant the necessaries privileges to manage the AWS S3 bucket. The format of this value is arn:aws:iam::<awsAccountId>:role/<roleName>.

  • When using Athena as cache database, you can only enable the cache mode full on the views. If you enable partial cache, the data will not be cached.

Azure SQL Data Warehouse

The Azure SQL Data Warehouse data source as cache have some limitations:

  • Azure SQL Data Warehouse does not support unique indexes in tables.

  • Data movements and creating materialized tables cannot be performed within a transaction.

  • Binaries and complex type columns (arrays and registers) are limited to 8,000 bytes. If this size is exceeded during a cache load process, it will end with the error “String or binary data would be truncated”.

  • The only available isolation level is READ UNCOMMITTED.

Make sure that in the configuration of the cache, in the box Driver class path, the option mssql-jdbc-7.x is selected. Otherwise, some features may not work properly.

Google BigQuery

When using Google BigQuery to store cached data, take the following into account:

  • You can only select HDFS as File System on the Read & Write tab.

  • In the tab, Read & Write, select Use bulk data load APIs and fill the field HDFS URI.

    Also it will be necessary to establish the following Hadoop properties:

    google.cloud.auth.service.account.enable = true
     google.cloud.auth.service.account.json.keyfile = < path of the JSON file that contains your service account key>
    
  • Virtual Dataport transfers data into BigQuery uploading them to Google Cloud Storage (GCS) first and loading them into the BigQuery table, after, from GCS. For this, VDP Server connects to Google Cloud Storage using the credentials from the JSON file referenced in the Hadoop properties.

  • This database support only full cache mode. Any view with partial cache will be ignored and behave as if it has no cache..

  • If a different catalog/schema from the default needs to be used for the caching, we can see how just one catalog appears into the drop-down list of the Read & Write tab, this is so because we have to include the project name (the project is the catalog in bigquery) into the connection URI when creating the data source and, for this reason, just this catalog will be allowed.

Kudu

When using Impala 3.x Kudu to store cached data, take the following into account:

  • The views that use Kudu as cache must have a primary key.

  • The views that use Kudu as cache have to set the Time to Live to ‘Never expire’. A view using a Time to Live different than ‘Never expire’ cannot use Kudu as cache data source.

PrestoDB/PrestoSQL

When using PrestoDB/PrestoSQL to store cached data, take the following into account:

  • You can only use it when the data is stored in the Hadoop Distributed File System (HDFS).

  • In the tab, Read & Write, select Use bulk data load APIs and fill the fields HDFS URI and Hadoop executable location. Otherwise, Virtual DataPort will not be able to cache data.

  • When using PrestoDB/PrestoSQL to store cached data, you must not select the cache mode Partial on a view. The data will not be cached.

  • Virtual DataPort transfers data into PrestoDB/PrestoSQL using non-managed (external) Hive tables. By using external tables, the URI configured to upload the data files can be different than the location of the schema used for caching or perform data movements. Make sure the configuration property of PrestoDB/PrestoSQL hive.non-managed-table-writes-enabled is true; otherwise, the process of storing data in PrestoDB/PrestoSQL will fail. To restore the behavior of previous versions of Denodo (i.e. not using external tables), execute this:

    SET 'com.denodo.vdb.util.tablemanagement.sql.PrestoTableManager.useExternalTables'='false';
    

    You do not need to restart to apply this change.

Spark

During the process of invalidating the cache, some queries may fail if they are executed while a Parquet file is changing.

Yellowbrick

Yellowbrick supports two types of encoding for its databases: LATIN9 (default) and UTF-8. Keep in mind that Virtual DataPort can provide full support only to databases using encoding UTF-8. You can select the database encoding when creating the database using encoding = UTF-8.

If a database with encoding LATIN9 is used as a cache data source, data from invalid characters may be lost when caching from other data sources.

Databases with HDFS Storage

Databases that use HDFS storage (Hive, Impala, Presto, Spark and Databricks) only support full cache mode. Any view with partial cache will be ignored and behave as if it has no cache.

The temporary files are generated using snappy compression. If you want to change the Parquet files compression, then execute this command from the VQL Shell:

SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.HdfsInsertWorker.parquet.compression'
    = '<compression mechanism>';

The value of “<compression mechanism>” can be off (no compression), snappy, gzip or lzo.

You do not need to restart after changing this property.

The database needs to support the selected compression algorithm.

Generating these files compressed may speed up the process loading the cache of a view or a data movement. However, it may not be the case if the database is in the same local area network as the Denodo server because the transfer speed is high. It is possible that there is not any reduction in time and an increase of CPU usage by the Denodo server because it has to generate the files compressed.

Besides, these databases do only support Full Cache invalidation.

Add feedback