Object Storage Data in Open Table Formats¶
Note
This feature is only available with the subscription bundle Enterprise Plus. To find out the bundle you have, open the About dialog of Design Studio. See more about this in the section Denodo Platform - Subscription Bundles.
Denodo includes embedded Massive Parallel Processing (MPP) capabilities to improve performance on environments containing data in an object storage. For this purpose, Denodo now embeds a customized version of Presto, which is an open source parallel SQL query engine that excels in accessing data lake content. The Denodo Embedded MPP cluster can be deployed following the instructions in the Embedded MPP Guide. The guide explains how to create a special data source in Denodo called “embedded_mpp”.
Among other things, this special data source allows one to connect an object storage graphically, explore its data and create base views on top of open table format files. To do so, open the data source, click the tab Read & Write and configure the section Object storage configuration.
Currently, Denodo supports the following open table formats:
Parquet, including the ones using the Hive style partitioning
Delta
Iceberg
UniForm
You must select the file system you want to access and provide the credential information. The file systems available graphically are S3, ADLS and HDFS. You can also use other systems like Google Cloud Storage and Huawei Object Storage Service (OBS). In case of Google Cloud Storage, it is compatible with the Hadoop API. To do this, select HDFS and provide the necessary Hadoop properties (see section Support for Hadoop-compatible storage). The authentication methods available are the same described in section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS. In order to access an HDFS using Kerberos follow instructions in section :ref:<How to Connect to a Kerberized HDFS From Embedded MPP Data Sources>.
Note
If there are different teams with different privileges on different buckets, you can configure the minimum credentials for all and then create a copy of this data source for each team configuring the specific credentials on each one.
Then, you can add routes that you want to explore from that object storage. Once you have saved the necessary credentials and routes, there are two options to create base views accessing data in that storage:
Using procedure DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE
Exploring and selecting the desired tables graphically.
For the graphical exploration you can click on ‘Create Base View’ to explore these routes and select the ones you want to import. Denodo automatically detects those folders corresponding to tables in open table format.
Note
In case Denodo returns an error or timeout trying to access an object storage visit section Troubleshooting problems accessing an object storage.
Each table format is identified with the logo of the corresponding technology.
Select the tables to import and click on ‘Create selected’ to create the base view. Denodo will create the base view and a table in the embedded data source to access the data. Denodo will create the table in catalog Hive and the schema of your choice. You can select the schema from the ones available from the dropdown ‘Target schema’ at the bottom of the “Create Base View” dialog (see image above).
Note
You can create a new schema in the Denodo Embedded MPP using the stored procedure CREATE_SCHEMA_ON_SOURCE and then click the refresh icon to select the new schema.
Note
In case there exist base views over tables with partitions, use procedure REFRESH_EMBEDDED_MPP_TABLES_METADATA to periodically update the partitions information.
