Object Storage Data in Open Table Formats (Parquet, Iceberg and Delta Lake)¶
Virtual DataPort allows you to connect to object storage graphically, explore data, and create base views for datasets in open table formats, such as Apache Parquet, Apache Iceberg, and Delta Lake.
You can also automate this process by using the DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE stored procedure to create base views for all datasets within a specific object storage location. Additionally, Virtual DataPort supports access to data registered in external catalog, including AWS Glue Data Catalog, Unity Catalog, Snowflake Open Catalog and Nessie. Furthermore, Virtual DataPort ensures high-performance access and data combination by leveraging the massive parallel processing (MPP) capabilities of the Denodo Lakehouse Accelerator.
The Denodo Platform includes the Denodo Lakehouse Accelerator, which embeds a Massive Parallel Processing engine to improve performance on environments containing data in an object storage. For this purpose, Denodo now embeds a customized version of Presto, which is an open source parallel SQL query engine that excels in accessing data lake content. The Denodo Lakehouse Accelerator cluster can be deployed following the instructions in the Denodo Lakehouse Accelerator Guide. The guide also explains how to create the special data source “embedded_mpp” in Denodo.
Note
This feature requires the Denodo Lakehouse Accelerator, which is only available with the subscription bundle Enterprise Plus. To find out the bundle you have, open the About dialog of Design Studio. See more about this in the section Denodo Platform - Subscription Bundles.
Creating Base Views for Parquet, Iceberg, and Delta Lake Datasets¶
The following steps assume that the Denodo Lakehouse Accelerator is active and that Virtual DataPort can access it via the predefined data source embedded_mpp.
Among other capabilities, this data source allows you to connect to object storage graphically, explore its data, and create base views for open table format files. To do so, open the data source, navigate to the Read & Write tab and locate the Object storage configuration section.
Currently, Denodo supports the following open table formats:
Configuration steps
Select the file system you wish to access and provide the required credentials. Then, add the storage routes you want to explore.
File system: Choose the system you want to access. While the graphical interface lists S3, ADLS, and HDFS, Virtual DataPort also supports other compatible services:
S3-Compatible Storage: For services like Huawei Object Storage Service (OBS).
Hadoop API-Compatible Storage: For services like Google Cloud Storage (GCS).
For configuration details on these and other compatible systems, see: Support for Hadoop-compatible storage).
Authentication:
The available authentication methods are detailed in Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS.
Kerberos: To access HDFS via Kerberos, follow the steps in How to Connect to a Kerberized HDFS From the Denodo Lakehouse Accelerator Data Source.
Using different credentials for different teams: If different teams require specific privileges for different buckets, configure the base credentials for all users first. Then, create a copy of this data source for each team to define their specific credentials. See in How to configure the Lakehouse accelerator to allow different Denodo developers to access different storage routes and catalogs.
Object storage routes: Add the routes to the buckets and folders in the storage including the datasets you want to access using base views in Virtual DataPort.
Create base views
Once you have provided the necessary credentials and routes, save your changes. You can then create base views using one of the following methods:
Automatically: Use the DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE stored procedure.
Graphically: Browse the object storage and select the desired tables directly through the interface.
For the graphical browsing navigate to ‘Create Base View’ to expand these routes and select the ones to import. Denodo automatically detects those folders corresponding to tables in open table format.
Note
In case Denodo returns an error or timeout trying to access an object storage visit section Troubleshooting problems accessing an object storage.
Each table format is identified with the logo of the corresponding technology.
Select the tables you wish to import and click Create Selected. Virtual DataPort will generate the base view and a corresponding table in the Lakehouse Accelerator to facilitate data access. The system creates the table within the Hive, Delta or Iceberg catalog (depending on the table format) and the schema of your choice. You can select a schema from the Target schema dropdown menu located at the bottom of the Create Base View dialog.
Note
For those scenarios that require creating the Lakehouse Accelerator table in a catalog different than the default one ()Hive, Delta or Iceberg), see section How to configure the Lakehouse accelerator to allow different Denodo developers to access different storage routes and catalogs.
Create base views from Parquet tables using Hive-style partitioning¶
Virtual DataPort will automatically obtain the column type from the Parquet files (as defined in Parquet Logical Types) except for partition columns in tables using Hive-style partitioning. Hive-style partitioning physically separates data in folders (for example, country=us/… or year=2025/month=01/day=26/…). In these cases, Virtual DataPort will infer the column type from the folder name. It is possible to modify the default mapping for these partition columns using the following properties:
“com.denodo.embeddedmpp.parquetTimestampColumnsToVarcharEnabled”: Timestamp columns will be treated as VARCHAR if it is set to
true. Default value:false.“com.denodo.embeddedmpp.parquetBooleanColumnsToVarcharEnabled”: Boolean columns will be treated as VARCHAR if it is set to
true. Default value:false.“com.denodo.embeddedmpp.parquetLongColumnsToDecimalEnabled”: Long columns will be treated as Decimal if it is set to
true. Default value:false.
Note
You can create a new schema in the Denodo Lakehouse Accelerator using the stored procedure CREATE_SCHEMA_ON_SOURCE and then click the refresh icon to select the new schema.
Note
In case there exist base views over tables with partitions, use procedure REFRESH_EMBEDDED_MPP_TABLES_METADATA to periodically update the partitions information.
Accessing data in Delta Lake format¶
Denodo offers different ways to access data in DeltaLake format:
Using Databricks as a data source.
Using the Denodo Lakehouse Accelerator to read the data in the storage directly.
Using the Denodo Lakehouse Accelerator to access a Unity catalog.
Note: Options using the Denodo Lakehouse Accelerator are only supported when it is configured to use the Java engine, rather than the Presto on Velox (C++) engine. Let’s examine each option in more detail.
Accessing Delta Lake tables using Databricks
If your Delta Lake tables correspond to tables managed in Databricks, you can create a DataBricks data source and generate the corresponding base views. For queries accessing these views, Virtual DataPort delegates processing to Databricks, which then manages the access to the physical data in storage.
Accessing Delta Lake tables via Object Storage with the Lakehouse Accelerator
You can create base views directly from storage as described in previous sections, either by using the DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE or by browsing graphically. Virtual DataPort will create a table within the Denodo Lakehouse Accelerator to access that data. By default, this table is placed in the catalog named “delta” within the Denodo Lakehouse Accelerator. For queries involving these base views, Virtual DataPort delegates the processing to the Denodo Lakehouse Accelerator.
When you promote these views between Denodo environments, the name of the dataset is not considered environment-specific by default. While this is typically appropriate, Delta Lake tables may have different names across different environments. In such cases, you can enable environment-specific values by setting the following property in the Virtual DataPort VQL Shell:
SET 'com.denodo.embeddedmpp.wrapperExternalPathEnvironmentSpecificEnabled' ='true'
For further details on promotions refer to section Promoting the Denodo Lakehouse Accelerator.
Accessing Delta Lake tables via Unity Catalog with the Lakehouse Accelerator
Additionally, you can register the Unity Catalog in the Lakehouse Accelerator and import the views from the External Catalogs tab. As with the previous option, Virtual DataPort delegates query processing to the Denodo Lakehouse Accelerator. Please note that this approach has specific limitations, which are described in the Delta Lake section of the Denodo Lakehouse Accelerator Guide.
For further details regarding Delta Lake support, including advanced configurations, please refer to the Delta Lake section of the Denodo Lakehouse Accelerator Guide.
