Data Lake Storage Management¶
Virtual DataPort supports accessing HDFS, Amazon S3, Azure ADLS Gen2 and other storage compatible with the previous ones for several use cases:
Access files in formats like delimited files, JSON or XML.
Access analytical data in Parquet, Delta or Iceberg formats.
Load data in data sources using this kind of storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP .
For information on how to configure bulk data load into an object storage see section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS. For information on how to access other compatible storage like Google Cloud Storage see section Support for Hadoop-compatible storage. The following sections provide more details on the support for different file and table formats stored in an object storage.
CSV, JSON and XML File Formats¶
In order to access CSVs (or other delimited files), JSON or XML files in an object storage follow the instructions in sections Delimited File Sources, JSON Sources and XML Sources respectively. It is necessary to select the right data route HDFS, S3 or Azure ADLS in order to configure the authentication. See section Support for Hadoop-compatible storage to access a different object storage like Google Cloud Storage.
Apache Parquet File Format¶
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Virtual DataPort supports accessing data in Parquet format using the Embedded MPP.
In addition, Virtual DataPort generates data in Parquet format for bulk insertions in databases using Hadoop-compatible storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP. See section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS for more information on this topic.
Delta Lake Table Format¶
Delta Lake is an open-source table format and it is the default format in Databricks. It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Denodo supports creating base views to access data in Delta format using the Embedded MPP. It also supports creating remote tables or load data in Databricks using Delta format.
Iceberg Table Format¶
Apache Iceberg is a high-performance table format for large analytic datasets. Iceberg tables support ACID transactions, full schema evolution, partition evolution and table version rollback without the need to rewrite or migrate tables.
The following table describes the features currently supported in Denodo using Iceberg tables. See section Iceberg tables for more details.
Feature |
Supported |
---|---|
Select |
Yes |
Insert |
Yes |
Yes |
|
Update |
No |
Delete |
No |
Merge |
No |
Yes |
|
Create base view from Object Storage |
No |
Yes |
|
Drop remote table |
Yes |
Create Summary View |
Yes |
Yes |
|
Yes |
To create summary views in Iceberg format follow the same instructions to create remote tables.