Object Storage data in Parquet and Delta Format¶
Note
This feature is only available with the subscription bundle Enterprise Plus. To find out the bundle you have, open the About dialog of the Design Studio or the Administration Tool. See more about this in the section Denodo Platform - Subscription Bundles.
Denodo includes embedded Massive Parallel Processing (MPP) capabilities to improve performance on environments containing data in an object storage. For this purpose, Denodo now embeds a customized version of Presto, which is an open source parallel SQL query engine that excels in accessing data lake content. The Denodo Embedded MPP cluster can be deployed following the instructions in the Embedded MPP Guide. The guide explains how to create a special data source in Denodo called “embedded_mpp”.
Among other things, this special data source allows one to connect an object storage graphically, explore its data and create base views on top of parquet files. To do so, open the data source, click the tab Read & Write and configure the section Object storage configuration. You must select the file system you want to access and provide the credential information. The file systems available graphically are S3, ADLS and HDFS. You can use other systems like Google Cloud Storage that are compatible with the Hadoop API. To do this, select HDFS and provide the necessary Hadoop properties (see section Support for Hadoop-compatible storage). The authentication methods available are the same described in section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS.
Note
If there are different teams with different privileges on different buckets, you can configure the minimum credentials for all and then create a copy of this data source for each team configuring the specific credentials on each one.
Then, you can add routes that you want to explore from that object storage. Once you have saved the necessary credentials and routes, you can click on ‘Create Base View’ to explore these routes and select the ones you want to import. Denodo automatically detects those folders corresponding to tables in Parquet format (including the ones using the Hive style partitioning) and Delta format.
Each table format is identified with the logo of the corresponding technology.
Select the tables to import and click on ‘Create selected’ to create the base view. Denodo will create the base view and a table in the embedded data source to access the data. Denodo will create the table in catalog Hive and the schema of your choice. You can select the schema from the ones available from the dropdown ‘Target schema’ at the bottom of the “Create Base View” dialog (see image above).
Note
You can create a new schema in the Denodo Embedded MPP using the stored procedure CREATE_SCHEMA_ON_SOURCE and then click the refresh icon to select the new schema.
How to Connect to a Kerberized HDFS from Embedded MPP Data Sources¶
The connection to a kerberized HDFS is slightly different for Embedded MPP data sources than for regular ones. The following configuration applies for both Object Storage and Bulk Load configuration. In order to configure the connection, first start by configuring the target file system and the target uris, as shown in the picture:
Then, some hadoop properties must be configured:
Name
Value
hadoop.security.authorization
true
hadoop.security.authentication
kerberos
fs.hdfs.impl
org.apache.hadoop.hdfs.DistributedFileSystem
fs.defaultFS
HDFS connection URI
dfs.namenode.kerberos.principal.pattern
hdfs/*@<krb_domain>
dfs.datanode.kerberos.principal
hdfs/<host>@<krb_domain>
There are supported three different authentication mechanisms:
Client principal & password: connects to the HDFS with the specified client principal and password. In order to configure it, execute the following commands:
SET 'com.denodo.vdb.util.hdfs.kerberos.client.principal'='<client-principal>@<kerberos-domain>'
SET 'com.denodo.vdb.util.hdfs.kerberos.password.secret'='<client password>'
SET 'com.denodo.vdb.util.hdfs.kerberos.keytab.'=NULL
SET 'com.denodo.vdb.util.hdfs.kerberos.passthrough'='false'
Client principal & keytab: connects to the HDFS with the specified client and keytab
SET 'com.denodo.vdb.util.hdfs.kerberos.client.principal'='<client-principal>@<kerberos-domain>'
SET 'com.denodo.vdb.util.hdfs.kerberos.password.secret'=NULL
SET 'com.denodo.vdb.util.hdfs.kerberos.keytab'='<path to keytab file>'
SET 'com.denodo.vdb.util.hdfs.kerberos.passthrough'='false'
Session credentials passthrough: connect to the HDFS with the same credentials (client principal and password/keytab) as the ones used to log in the VDP server. In order to use this mechanism, kerberos authentication has to be configured in VDP (see Setting-Up the Kerberos Authentication in the Virtual DataPort Server).
SET 'com.denodo.vdb.util.hdfs.kerberos.client.principal'=NULL
SET 'com.denodo.vdb.util.hdfs.kerberos.password.secret'=NULL
SET 'com.denodo.vdb.util.hdfs.kerberos.keytab'=NULL
SET 'com.denodo.vdb.util.hdfs.kerberos.passthrough'='true'
The previous configuration applies to all Embedded MPP data sources. To particularize it to a certain Embedded MPP data source, you can parameterize the given properties with the database name
and data source name
, as follows:
SET 'com.denodo.vdb.util.hdfs.kerberos.client.principal.<db-name>.<ds-name>'='<client-principal>@<kerberos-domain>'
SET 'com.denodo.vdb.util.hdfs.kerberos.password.<db-name>.<ds-name>.secret'=NULL
Note
To be able to use the test bulk load, these properties have to be configured globally (as fallback).
The Embedded MPP has some limitations when dealing with kerberos secured catalogs:
It is not possible to access to two different HDFS if one is secured using Kerberos and the other is not. In this scenario, it is not possible to access the HDFS that does not use Kerberos.
It is not possible to create a base view from data in Delta format inside an HDFS.
Kerberos Constrained Delegation is not supported.
From Object Storage vs From MPP Catalogs¶
The Create Base View dialog for the Embedded MPP data source offers two options: From Object Storage and From MPP Catalogs:
From Object Storage is the default option to use. It allows to explore the different routes defined on the Read & Write tab and create base views for the desired folders.
From MPP Catalogs is an advanced option that allows to access catalogs that have been defined on the Denodo Embedded MPP cluster. For example, the system catalog is an internal catalog that provides information and metrics about the currently running cluster.
Schema Evolution¶
If the column names and types of the data can change over time, Virtual DataPort will analyze multiple files and create a view containing all the columns found in this analysis. To enable this behavior, select the check box Consider schema evolution on the left corner of the introspection panel before creating the views.
When creating views with the schema evolution enabled, if any Parquet table is selected, a new panel will be shown before the introspection starts. This panel will let you decide over certain introspection parameters. A table is shown, with one row for each selected table. The table format is as follows:
Table: shows the selected table name.
Last modification date: shows the timestamp when the table was lately modified (in UTC).
Number of files to read: establishes the maximum number of files (file limit) to analyze when introspecting the table. Increase this value to obtain compatibility with more changing columns, or lower it in order to get a more narrower schema.
Read after this date (optional): in this field you can specify the time instant (in UTC time format) from which the files will be analyzed. The files will be analyzed from oldest to newest, starting on the specified instant until the file limit is reached. If this field is empty, Virtual DataPort will analyze the files from newest to oldest until the file limit is reached starting on the Last modification date of the table.
Note
For Delta tables, the schema information is automatically obtained from the Delta logs, so the schema evolution configuration will not apply to them.
Click on OK to start the introspection and table creation process. Click on Cancel to cancel introspection and return to the previous panel. As introspecting big parquet directories can be time consuming, the introspection status is informed and the process can be cancelled at any time:
Source Refresh on Views from Object Storage¶
If the base view access data in Parquet format from Object Storage using the embedded MPP, the Source Refresh has additional options to deal with the Schema Evolution. A new panel will be prompted before executing the source refresh, where you can tune up the introspection process:
This panel is analog to the one presented in the previous section:
Maximum number of files to analyze: establishes the maximum number of files (file limit) to analyze when refreshing the table.
Consider files modified after this date (optional): in this field you can specify the time instant (in UTC time format) from which the files will be analyzed. The files will be analyzed from oldest to newest, starting on the specified instant until the file limit is reached. If this field is empty, Virtual DataPort will analyze the files from newest to oldest until the file limit is reached.
Manage Views Created from Object Storage¶
Views created from files in an object storage are different from other views in some aspects:
Inserts: it is not currently possible to insert data into these kinds of views using Denodo.
Export: the VQL of the view includes an id to identify the route defined on the data source for the external object storage. It also includes the relative path from that base route.
Lineage: if the base route of the data changes or the schema of the files (columns and types) differs from the schema in the view, the view will turn into an invalid state. In order for the view to be valid again you can recreate it again from the data source with the new schema changes or fix the route to point to the right files.