Object Storage data in Parquet Format¶
This feature is only available with the subscription bundle Enterprise Plus. To find out the bundle you have, open the About dialog of the Design Studio or the Administration Tool. See more about this in the section Denodo Platform - Subscription Bundles.
Since the update 8.0u20230301, Denodo includes embedded Massive Parallel Processing (MPP) capabilities to improve performance on environments containing data in an object storage. For this purpose, Denodo now embeds a customized version of Presto, which is an open source parallel SQL query engine that excels in accessing data lake content. The Denodo Embedded MPP cluster can be deployed following the instructions in the Presto cluster on Kubernetes user manual. Versions of that utility newer than 20221018 include a final step that creates a new special data source in Denodo called “embedded_mpp”.
Among other things, this special data source allows one to connect an object storage graphically, explore its data and create base views on top of parquet files. To do so, open the data source, click the tab Read & Write and configure the section Object storage configuration. You must select the file system you want to access and provide the credential information. The file systems available graphically are S3 and HDFS. You can use other systems like Azure Data Lake Storage that are compatible with the Hadoop API. To do this, select HDFS and provide the necessary Hadoop properties (see section Support for Hadoop-compatible routes). The authentication methods available are the same described in section Bulk Data Load on a Distributed Object Storage like HDFS or S3.
If there are different teams with different privileges on different buckets, you can configure the minimum credentials for all and then create a copy of this data source for each team configuring the specific credentials on each one.
Then, you can add routes that you want to explore from that object storage. Once you have saved the necessary credentials and routes, you can click on ‘Create Base View’ to explore these routes and select the ones you want to import. Denodo automatically detects those folders corresponding to tables in Parquet format (including the ones using the Hive style partitioning).
Select the tables to import and click on ‘Create selected’ to create the base view. Denodo will create the base view and a table in the embedded data source to access the data. Denodo will create the table in catalog Hive and the schema of your choice. You can select the schema from the ones available from the dropdown ‘Target schema’ at the bottom of the “Create Base View” dialog (see image above).
You can create a new schema in the Denodo Embedded MPP using the stored procedure CREATE_SCHEMA_ON_SOURCE and then click the refresh icon to select the new schema.
From Object Storage vs From MPP Catalogs¶
The Create Base View dialog for the Embedded MPP data source offers two options: From Object Storage and From MPP Catalogs:
From Object Storage is the default option to use. It allows to explore the different routes defined on the Read & Write tab and create base views for the desired folders.
From MPP Catalogs is an advanced option that allows to access catalogs that have been defined on the Denodo Embedded MPP cluster. For example, the system catalog is an internal catalog that provides information and metrics about the currently running cluster.
If the column names and types of the data can change over time, Virtual DataPort will analyze multiple files and create a view containing all the columns found in this analysis. To enable this behavior you must enable the checkbox Consider schema evolution on the left corner of the introspection panel before creating the views.
When creating views with the schema evolution enabled, a new panel will be shown before the introspection starts. This panel will let you decide over certain introspection parameters. A table is shown, with one row for each selected Parquet table. The table format is as follows:
PARQUET table: shows the selected table name.
Last modification date: shows the timestamp when the table was lately modified (in UTC).
Number of files to read: establishes the maximum number of files (file limit) to analyze when introspecting the table. Increase this value to obtain compatibility with more changing columns, or lower it in order to get a more narrower schema.
Read after this date (optional): in this field you can specify the time instant (in UTC time format) from which the files will be analyzed. The files will be analyzed from oldest to newest, starting on the specified instant until the file limit is reached. If this field is empty, Virtual DataPort will analyze the files from newest to oldest until the file limit is reached starting on the Last modification date of the table.
Click on OK to start the introspection and table creation process. Click on Cancel to cancel introspection and return to the previous panel. As introspecting big parquet directories can be time consuming, the introspection status is informed and the process can be cancelled at any time:
Source Refresh on views from Object Storage¶
If the base view access data from Object Storage using the embedded MPP, the Source Refresh has additional options to deal with the Schema Evolution. A new panel will be prompted before executing the source refresh, where you can tune up the introspection process:
This panel is analog to the one presented in the previous section:
Maximum number of files to analyze: establishes the maximum number of files (file limit) to analyze when refreshing the table.
Consider files modified after this date (optional): in this field you can specify the time instant (in UTC time format) from which the files will be analyzed. The files will be analyzed from oldest to newest, starting on the specified instant until the file limit is reached. If this field is empty, Virtual DataPort will analyze the files from newest to oldest until the file limit is reached.
Manage Views Created from Parquet Files¶
Views created from Parquet files in an object storage are different from other views in some aspects:
Inserts: it is not currently possible to insert data into these kinds of views using Denodo.
Export: the VQL of the view includes an id to identify the route defined on the data source for the external object storage. It also includes the relative path from that base route.
Lineage: if the base route of the data changes or the schema of the files (columns and types) differs from the schema in the view, the view will turn into an invalid state. In order for the view to be valid again you can recreate it again from the data source with the new schema changes or fix the route to point to the right files.