USER MANUALS


Data Lake Storage Management

Virtual DataPort supports accessing HDFS, Amazon S3, Azure ADLS Gen2 and other storage compatible with the previous ones for several use cases:

  • Access files in formats like delimited files, JSON or XML.

  • Access analytical data in Parquet, Delta Lake or Iceberg formats.

  • Load data in data sources using this kind of storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP .

For information on how to configure bulk data load into an object storage, see section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS. For information on how to access other compatible storage like Google Cloud Storage, see section Support for Hadoop-compatible storage. The following sections provide more details on the support for different file and table formats stored in an object storage. The final section Troubleshooting problems accessing an object storage provides useful information to troubleshoot errors or timeouts accessing an object storage.

CSV, JSON and XML File Formats

In order to access CSVs (or other delimited files), JSON or XML files in an object storage follow the instructions in sections Delimited File Sources, JSON Sources and XML Sources respectively. It is necessary to select the right data route HDFS, S3 or Azure ADLS in order to configure the authentication. See section Support for Hadoop-compatible storage to access a different object storage like Google Cloud Storage.

Apache Parquet File Format

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.

Virtual DataPort supports accessing data in Parquet format using the Embedded MPP engine.

In addition, Virtual DataPort generates data in Parquet format for bulk insertions in databases using Hadoop-compatible storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP. See section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS for more information on this topic.

Delta Lake Table Format

Delta Lake is an open-source table format and it is the default format in Databricks. It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Denodo supports creating base views to access data in Delta Lake format using the Embedded MPP engine. It also supports creating remote tables or load data in Databricks using Delta Lake format.

Iceberg Table Format

Apache Iceberg is a high-performance table format for large analytic datasets. Iceberg tables support ACID transactions, full schema evolution, partition evolution and table version rollback without the need to rewrite or migrate tables.

It is possible to read existing Iceberg tables and also create new tables in your data lake using Iceberg format. In both cases the data in those tables is read in parallel using the Embedded MPP engine. The following table describes the features currently supported in Denodo using Iceberg tables. See section Iceberg tables for more details.

Feature

Supported

Select

Yes

Insert

Yes

Bulk Data Load

Yes

Update

No

Delete

No

Merge

No

Create base view from MPP Catalogs

Yes

Create base view from Object Storage

Yes

Create remote table

Yes

Drop remote table

Yes

Create Summary View

Yes

Cache

Yes

Rollback

Yes

To create summary views in Iceberg format follow the same instructions to create remote tables.

There exists some maintenance tasks that is recommended to execute on a regular basis to clean unnecessary metadata as it can lead to a degradation in performance, especially in tables with frequent updates:

  • Use the stored procedure REMOVE_ICEBERG_VIEW_ORPHAN_FILES to remove the files which are not referenced by any metadata file of an Iceberg view.

  • Use the stored procedure REMOVE_ICEBERG_VIEW_SNAPSHOTS to remove the desired snapshot metadata information from the underlying Iceberg information of a virtual view.

Time travel support

Denodo supports time travel in Iceberg tables using the procedure ROLLBACK_ICEBERG_VIEW_TO_SNAPSHOT.

In order to restore the data from a previous version it is possible to rollback the data to a specific point in time or to a specific snapshot. Use the procedure GET_ICEBERG_VIEW_SNAPSHOTS to know the list of snapshots of the table.

In order to see a previous version of a specific view without modifying the current view:

  1. Create a different base view view_at_timestampXXX on top of the same Iceberg data

  2. Run rollback procedure on view_at_timestampXXX. view_at_timestampXXX will have the data corresponding to that specific timestamp and therefore it is possible to query it and compare the data to the other one.

Note

This feature requires the Denodo Embedded MPP, which is only available with the subscription bundle Enterprise Plus. To find out the bundle you have, open the About dialog of Design Studio. See more about this in the section Denodo Platform - Subscription Bundles.

Troubleshooting problems accessing an object storage

If you are experiencing problems accessing an object storage like S3 or Azure from Denodo follow these steps to troubleshoot the issue:

  • Review the network security rules for the storage to verify Denodo Virtual DataPort can access.

  • If you are using SSL/TLS to access the object storage and the certificate is signed by a private authority, or it is self-signed, make sure that it is included in the truststore of the Virtual DataPort servers.

  • Review the Virtual DataPort log (<DENODO_HOME>/logs/vdp/vdp.log).

  • If the log does not provide enough information, execute the following from a VQL Shell of Design Studio to log more information:

    CALL LOGCONTROLLER('com.denodo.vdb.util.hdfs', 'TRACE');
    CALL LOGCONTROLLER('org.apache.hadoop.fs.FileSystem', 'DEBUG');
    
  • Test the connection to the storage route again.

  • Restore the log levels to error.

  • Review log <DENODO_HOME>/logs/vdp/vdp.log.

  • Finally, in order to debug problems with the SSL connection, if none of the previous steps have clarified the issue:

    • Include the following JVM parameter in the Virtual DataPort servers:

    -Djavax.net.debug=all
    
  • Test the connection.

  • Remove the JVM parameter as it is very verbose.

  • Review log <DENODO_HOME>/logs/vdp/vdp.log

Add feedback