Data Lake Storage Management¶
Virtual DataPort supports accessing HDFS, Amazon S3, Azure ADLS Gen2 and other storage compatible with the previous ones for several use cases:
Access files in formats like delimited files, JSON or XML.
Access analytical data in Parquet, Delta or Iceberg formats.
Load data in data sources using this kind of storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP .
For information on how to configure bulk data load into an object storage see section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS. For information on how to access other compatible storage like Google Cloud Storage see section Support for Hadoop-compatible storage. The following sections provide more details on the support for different file and table formats stored in an object storage. Final section Troubleshooting problems accessing an object storage provide useful information to troubleshot errors or timeouts accessing an object storage.
CSV, JSON and XML File Formats¶
In order to access CSVs (or other delimited files), JSON or XML files in an object storage follow the instructions in sections Delimited File Sources, JSON Sources and XML Sources respectively. It is necessary to select the right data route HDFS, S3 or Azure ADLS in order to configure the authentication. See section Support for Hadoop-compatible storage to access a different object storage like Google Cloud Storage.
Apache Parquet File Format¶
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Virtual DataPort supports accessing data in Parquet format using the Embedded MPP.
In addition, Virtual DataPort generates data in Parquet format for bulk insertions in databases using Hadoop-compatible storage like Hive, Impala, Presto, Spark, Databricks or the Denodo Embedded MPP. See section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS for more information on this topic.
Delta Lake Table Format¶
Delta Lake is an open-source table format and it is the default format in Databricks. It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Denodo supports creating base views to access data in Delta format using the Embedded MPP. It also supports creating remote tables or load data in Databricks using Delta format.
Iceberg Table Format¶
Apache Iceberg is a high-performance table format for large analytic datasets. Iceberg tables support ACID transactions, full schema evolution, partition evolution and table version rollback without the need to rewrite or migrate tables.
The following table describes the features currently supported in Denodo using Iceberg tables. See section Iceberg tables for more details.
Feature |
Supported |
---|---|
Select |
Yes |
Insert |
Yes |
Yes |
|
Update |
No |
Delete |
No |
Merge |
No |
Yes |
|
Yes |
|
Yes |
|
Drop remote table |
Yes |
Create Summary View |
Yes |
Yes |
|
Yes |
To create summary views in Iceberg format follow the same instructions to create remote tables.
Troubleshooting problems accessing an object storage¶
If you are experiencing problems accessing an object storage like S3 or Azure from Denodo follow these steps to troubleshoot the issue:
Review the network security rules for the storage to verify Denodo Virtual DataPort can access.
If you are using SSL/TLS to access the object storage and the certificate is signed by a private authority, or it is self-signed, make sure that it is included in the truststore of the Virtual DataPort servers.
Review the Virtual DataPort log (<DENODO_HOME>/logs/vdp/vdp.log).
In case it is accessing an Azure storage account:
TLS 1.0 and 1.1 support will be removed for new & existing Azure storage accounts starting Nov 2024.
Recent TLS 1.3 support in Azure storage accounts may cause connections using SSL/TLS to fail and Denodo could return a timeout as the connection never goes through. In that case, include the following JVM parameters to specify the TLS versions Virtual DataPort should allow excluding version 1.3. For instance:
-Dhttps.protocols="TLSv1,TLSv1.1,TLSv1.2" -Djdk.tls.client.protocols="TLSv1,TLSv1.1,TLSv1.2"
In any other case, if the log does not provide enough information, execute the following from a VQL Shell of the Design Studio to log more information:
CALL LOGCONTROLLER('com.denodo.vdb.util.hdfs', 'TRACE'); CALL LOGCONTROLLER('org.apache.hadoop.fs.FileSystem', 'DEBUG');
Test the connection to the storage route again.
Restore the log levels to error.
Review log <DENODO_HOME>/logs/vdp/vdp.log.
Finally, in order to debug problems with the SSL connection, if none of the previous steps have clarified the issue:
Include the following JVM parameter in the Virtual DataPort servers:
-Djavax.net.debug=all
Test the connection.
Remove the JVM parameter as it is very verbose.
Review log <DENODO_HOME>/logs/vdp/vdp.log