Object Storage data in Parquet format¶
This feature is only available with the subscription bundle Enterprise Plus. To find out the bundle you have, open the About dialog of the Design Studio or the Administration Tool. See more about this in the section Denodo Platform - Subscription Bundles.
Since the update 8.0u20230301, Denodo includes embedded parallel processing (MPP) capabilities to improve performance on environments containing data in an object storage. For this purpose, Denodo now embeds a customized version of Presto, which is an open source parallel SQL query engine that excels in accessing data lake content. A Presto cluster can be deployed following the instructions in the Presto cluster on Kubernetes user manual. Versions of that utility newer than 20221018 include a final step that creates a new special data source in Denodo called “embedded_mpp”.
Among other things, this special data source allows one to connect an object storage graphically, explore its data and create base views on top of parquet files. To do so, open the data source, click the tab Read & Write and configure the section Object storage configuration. You must select the file system you want to access and provide the credential information. The file systems available graphically are S3 and HDFS. You can use other systems like Azure Data Lake Storage that are compatible with the Hadoop API. To do this, select HDFS and provide the necessary Hadoop properties (see section Support for Hadoop-compatible routes). The authentication methods available are the same as for bulk load to PrestoDB and Impala.
If there are different teams with different privileges on different buckets, you can configure the minimum credentials for all and then create a copy of this data source for each team configuring the specific credentials on each one.
Then, you can add routes that you want to explore from that object storage. Once you have saved the necessary credentials and routes, you can click on ‘Create Base View’ to explore these routes and select the ones you want to import. Denodo automatically detects those folders corresponding to tables in Parquet format (including the ones using the Hive style partitioning).
Select the tables to import and click on ‘Create selected’ to create the base view. Denodo will create the base view and a table in the embedded Presto data source to access the data. Denodo will create the table in catalog Hive and the schema of your choice. You can select the schema from the ones available from the dropdown ‘Target schema’ at the bottom of the “Create Base View” dialog (see image above).
You can create a new schema in the Presto MPP using the stored procedure CREATE_SCHEMA_ON_SOURCE and then click the refresh icon to select the new schema.
Manage Views Created from Parquet Files¶
Views created from Parquet files in an object storage are different from other views in some aspects:
View statistics: to gather the statistics of the view, use the procedure COMPUTE_SOURCE_TABLE_STATS first.
Source refresh: Source Refresh option is not currently available but you can use the procedure REFRESH_SOURCE_TABLE_METADATA to update the partitions information.
Inserts: it is not currently possible to insert data into these kinds of views using Denodo.
Export: the VQL of the view includes an id to identify the route defined on the data source for the external object storage. It also includes the relative path from that base route.
Lineage: if the base route of the data changes or the schema of the files (columns and types) differs from the schema in the view, the view will turn into an invalid state. In order for the view to be valid again you can recreate it again from the data source with the new schema changes or fix the route to point to the right files.