Specific Information about Cache Databases¶
This section contains information you have to take into account depending on the database you use to store cached data.
Amazon Athena¶
When using Athena to store cached data, take the following into account:
To use Amazon Athena as cache database, you have to enable Bulk Data Load to upload the data to S3.
In the tab, Read & Write, select Use bulk data load APIs and fill the fields Authentication type, AWS access key id, AWS secret access key, AWS IAM role ARN (optional), and the S3 Location. Otherwise, Virtual DataPort will not be able to cache data.
The AWS IAM role ARN is the Amazon Resource Name of the IAM role to be assumed by the user. It is an optional field. The role will grant the necessaries privileges to manage the AWS S3 bucket. The format of this value is
arn:aws:iam::<awsAccountId>:role/<roleName>
.When using Athena as cache database, you can only enable the cache mode full on the views. If you enable partial cache, the data will not be cached.
Azure SQL Data Warehouse¶
The Azure SQL Data Warehouse data source as cache have some limitations:
Azure SQL Data Warehouse does not support unique indexes in tables.
Data movements and creating materialized tables cannot be performed within a transaction.
Binaries and complex type columns (arrays and registers) are limited to 8,000 bytes. If this size is exceeded during a cache load process, it will end with the error “String or binary data would be truncated”.
The only available isolation level is READ UNCOMMITTED.
Make sure that in the configuration of the cache, in the box Driver class path, the option mssql-jdbc-7.x is selected. Otherwise, some features may not work properly.
Google BigQuery¶
When using Google BigQuery to store cached data, take the following into account:
You can only select HDFS as File System on the Read & Write tab.
In the tab, Read & Write, select Use bulk data load APIs and fill the field HDFS URI.
Also it will be necessary to establish the following Hadoop properties:
google.cloud.auth.service.account.enable = true google.cloud.auth.service.account.json.keyfile = < path of the JSON file that contains your service account key>
Virtual DataPort transfers data into BigQuery uploading them to Google Cloud Storage (GCS) first and loading them into the BigQuery table, after, from GCS. For this, VDP Server connects to Google Cloud Storage using the credentials from the JSON file referenced in the Hadoop properties.
This database support only full cache mode. Any view with partial cache will be ignored and behave as if it has no cache.
If a different catalog/schema from the default needs to be used for the caching, we can see how just one catalog appears into the drop-down list of the Read & Write tab, this is so because we have to include the project name (the project is the catalog in BigQuery) into the connection URI when creating the data source and, for this reason, just this catalog will be allowed.
Kudu¶
When using Impala 3.x Kudu to store cached data, take the following into account:
The views that use Kudu as cache must have a primary key.
The views that use Kudu as cache have to set the Time to Live to ‘Never expire’. A view using a Time to Live different than ‘Never expire’ cannot use Kudu as cache data source.
PrestoDB/Trino¶
When using PrestoDB/Trino to store cached data, take the following into account:
It only supports bulk loads, so it is necessary to select Use bulk data load APIs on tab Read & Write and specify the system storage, credentials and route for the temporary files. See section Databases Using a Distributed Object Storage.
When using PrestoDB/Trino to store cached data, you must not select the cache mode Partial on a view. The data will not be cached.
Virtual DataPort transfers data into PrestoDB/Trino using non-managed (external) Hive tables. By using external tables, the URI configured to upload the data files can be different than the location of the schema used for caching or perform data movements. Make sure the configuration property of PrestoDB/Trino
hive.non-managed-table-writes-enabled
istrue
; otherwise, the process of storing data in PrestoDB/Trino will fail. To restore the behavior of previous versions of Denodo (i.e. not using external tables), execute this:SET 'com.denodo.vdb.util.tablemanagement.sql.PrestoTableManager.useExternalTables'='false';
You do not need to restart to apply this change.
Embedded Denodo MPP¶
It only supports full cache mode and it requires to configure bulk data loads.
See section Databases Using a Distributed Object Storage.
Spark¶
During the process of invalidating the cache, some queries may fail if they are executed while a Parquet file is changing.
Yellowbrick¶
Yellowbrick supports two types of encoding for its databases: LATIN9 (default) and UTF-8.
Keep in mind that Virtual DataPort can provide full support only to databases using encoding UTF-8. You can select the database encoding when creating the database using encoding = UTF-8
.
If a database with encoding LATIN9 is used as a cache data source, data from invalid characters may be lost when caching from other data sources.
Databases Using a Distributed Object Storage¶
For databases using a distributed object storage, the supported systems are S3, ADLS, HDFS and any other storage compatible with the Hadoop API like Azure storages.
Databases that use Hadoop-compatible object storage (Hive, Impala, Presto, Spark and Databricks) only support full cache mode. Any view with partial cache will be ignored and behave as if it has no cache.
The temporary files are generated using Parquet format and snappy
compression by default.
See section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS for more details on these configuration.
Besides, these databases do only support Full Cache invalidation.