USER MANUALS


Data Lake Storage Management

Virtual DataPort supports accessing HDFS, Amazon S3, Azure ADLS Gen2 and other storage compatible with the previous ones for several use cases:

  • Access files in formats like delimited files, JSON or XML.

  • Access analytical data in Parquet, Delta or Iceberg formats.

  • Load data in data sources using this kind of storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP .

For information on how to configure bulk data load into an object storage see section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS. For information on how to access other compatible storage like Google Cloud Storage see section Support for Hadoop-compatible storage. The following sections provide more details on the support for different file and table formats stored in an object storage.

CSV, JSON and XML File Formats

In order to access CSVs (or other delimited files), JSON or XML files in an object storage follow the instructions in sections Delimited File Sources, JSON Sources and XML Sources respectively. It is necessary to select the right data route HDFS, S3 or Azure ADLS in order to configure the authentication. See section Support for Hadoop-compatible storage to access a different object storage like Google Cloud Storage.

Apache Parquet File Format

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.

Virtual DataPort supports accessing data in Parquet format using the Embedded MPP.

In addition, Virtual DataPort generates data in Parquet format for bulk insertions in databases using Hadoop-compatible storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP. See section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS for more information on this topic.

Delta Lake Table Format

Delta Lake is an open-source table format and it is the default format in Databricks. It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Denodo supports creating base views to access data in Delta format using the Embedded MPP. It also supports creating remote tables or load data in Databricks using Delta format.

Iceberg Table Format

Apache Iceberg is a high-performance table format for large analytic datasets. Iceberg tables support ACID transactions, full schema evolution, partition evolution and table version rollback without the need to rewrite or migrate tables.

The following table describes the features currently supported in Denodo using Iceberg tables. See section Iceberg tables for more details.

Feature

Supported

Select

Yes

Insert

Yes

Bulk Data Load

Yes

Update

No

Delete

No

Merge

No

Create base view from MPP Catalogs

Yes

Create base view from Object Storage

No

Create remote table

Yes

Drop remote table

Yes

Create Summary View

Yes

Cache

Yes

Rollback

Yes

To create summary views in Iceberg format follow the same instructions to create remote tables.

Add feedback