Metadata Discovery Tool¶
The Metadata Discovery Tool scans the distributed storage, starting from a folder given by the user, and for each Parquet dataset it finds, in a folder or subfolder, creates the corresponding Hive table in the Denodo Embedded MPP and the corresponding base view in Denodo VDP.
The Metadata Discovery Tool can create Hive tables from partitioned datasets that follow the Hive naming convention for folders, i.e.: columnName=value, elb_name=abc
.
.../weblogs/elb_name=abc/elb_response_code=200/…
.../weblogs/elb_name=abc/elb_response_code=404/…
.../weblogs/elb_name=xyz/elb_response_code=200/…
The Metadata Discovery Tool invokes sync_partition_metadata
automatically, after each partition table creation.
The tool is distributed as a Denodo stored procedure and as a script.
Warning
The Metadata Discovery Tool is designed to be used with a standard JDBC data source connected to the Denodo Embedded MPP.
Does not work with the Denodo Embedded MPP data source, throwing the error:
It is not allowed to create views from the selected catalog of the embedded MPP data source.
Stored Procedure DISCOVER_PRESTO_METADATA¶
The DISCOVER_PRESTO_METADATA
stored procedure is distributed in the folder metadata-discovery/sp/denodo-presto-k8scluster-{version}-jar-with-dependencies.jar
.
In order to use this store procedure in Denodo, first, you have to import the file denodo-presto-k8scluster-{version}-jar-with-dependencies.jar
using the menu option
File > Extension management
in VDP Administration.
Secondly, you have to create the stored procedure by clicking the menu option File > New > Stored procedure
in the VDP Administration, and provide the following values:
Name: a name
Class name:
com.denodo.connect.presto.MetadataDiscoveryStoredProcedure
Select Jars:
denodo-discover-presto-metadata
Finally, you can execute the stored procedure:
Using the Execute button in the dialog that displays the schema of the stored procedure. Denodo will display a dialog where you have to enter the input values.
Using the VQL Shell:
SELECT * FROM discoverprestometadata() WHERE dir_to_explore='s3a://bucket/dir/' and credentials_path='/absolute/path/to/core-site.xml' and datasource_name='the_presto_datasource' and duplicate_strategy='NEVER_REPLACE' and included_pattern=NULL and baseview_folder='/01 - Connectivity/02 - Base Views' and baseview_prefix='bv_' and analyze_tables=true
The input parameters required by the stored procedure are:
dir_to_explore
: directory of the distributed storage path (s3a, wasb, abfs
), to start the metadata discovery process.datasource_name
: the name of a standard JDBC data source connected to the Denodo Embedded MPP (for more information see Connect from Denodo section). The base views created by this stored procedure will belong to this data source.
The following input parameters are optional:
credentials_path
: absolute path tocore-site.xml
with the distributed storage credentials.duplicate_strategy
: chooses what to do with the metadata that already exists in the Denodo Embedded MPP and in Denodo VDP. Possible values areNEVER_REPLACE
,REPLACE_IF_NEW_COLUMNS
,ALWAYS_REPLACE
. If no value is provided for the parameterNEVER_REPLACE
is selected.NEVER_REPLACE
: It will not replace the Hive tables in the Embedded MPP and base views at Denodo VDP if they already exist.REPLACE_IF_NEW_COLUMNS
: It will replace the Hive tables in the Embedded MPP, and their corresponding base views in Denodo VDP, if new columns are detected for the existing Hive tables. This may affect existing Denodo base views that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in the views that derive from these base views.ALWAYS_REPLACE
: It will replace the Hive tables in the Embedded MPP, and their corresponding base views in VDP, if any kind of change is detected in the existing Hive tables: new columns, deleted columns, different types of columns, etc. This may affect the existing base views in Denodo that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in views that derive from these base views.
included_pattern
: a regular expression that will select only certain files or folders in the scanning process. For example:(.*)invoices(.*)
baseview_folder
: all base views created by the stored procedure will be placed in this folder when this parameter is specified. The folder will be created by the stored procedure if necessary.baseview_prefix
: all base views created by the stored procedure will be prefixed with this value when this parameter is specified. When this parameter is not set and a base view with an invalid name is to be created, the prefix"bv"
will be added.analyze_tables
: whether or not to execute theANALYZE
command after each table creation. This command collects statistical information about the data that the Embedded MPP optimizer uses to plan the query, based on cost strategies.
The output of this stored procedure is EXECUTED_COMMAND
, which shows the DDL statements executed successfully in the Denodo Embedded MPP and in Denodo VDP:
the SQL of each tables created in the Embedded MPP
the VQL of each wrapper and base view created in Denodo
Script for Metadata Discovery¶
The metadata discovery script is distributed in the metadata-discovery/bin
folder. A configuration file is required as input:
./denodo-presto-metadata-discovery.sh ../conf/configuration.properties
Configuration File¶
dirToExplore=s3a://example-bucket/dir/
credentials.xml.path=../conf/core-site.xml
#includedPattern=(.*)blahblah(.*)
# Denodo connection properties
denodo.jdbcUrl=jdbc:vdb://localhost:9999/presto_tests
denodo.username=admin
denodo.password=admin
presto.jdbcUrl=jdbc:presto://localhost:8443/hive/default?SSL=true\&SSLTrustStorePath=truststore-path&SSLTrustStorePassword=truststore-pwd
presto.username=presto
presto.password=pr3st%
# Default values
denodo.datasource.name=ds_presto_k8s
denodo.datasource.folder=/01 - Connectivity/01 - Data Sources
denodo.baseview.folder=/01 - Connectivity/02 - Base Views
denodo.baseview.prefix=bv_
presto.analyze.tables=true
# Possible values: NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, REPLACE_ALWAYS
metadata.duplicated.strategy=NEVER_REPLACE
dirToExplore
: directory of the distributed storage path (s3a, adl, wasb, abfs
), to start the metadata discovery process.credentials.xml.path
:core-site.xml
with the distributed storage credentials.includedPattern
: a regular expression that will select only certain files or folders in the scanning process. For example:(.*)invoices(.*)
.Denodo connection properties
: JDBC URI, user and password.Presto Connection properties
: JDBC URI, user and password.denodo.datasource.name
: the name of the standard JDBC data source connected to the Denodo Embedded MPP that will be created in Denodo by this script.denodo.datasource.folder
: the JDBC data source created in Denodo by the script will be placed in this folder when this parameter is specified. The folder will be created by the script if needed.denodo.baseview.folder
: all base views created by the script will be placed in this folder when this parameter is specified. The folder will be created by the script if necessary.denodo.baseview.prefix
: all base views created by the script will be prefixed with this value when this parameter is specified. When this parameter is not set and a base view with an invalid name is to be created, the prefix"bv"
will be added.presto.analyze.tables
: whether or not to execute the ANALYZE command after each table creation. This command collects statistical information about the data that the Denodo Embedded MPP optimizer uses to plan the query, based on cost strategies. Default istrue
.metadata.duplicated.strategy
: chooses what to do with the metadata that already exists in the Denodo Embedded MPP and in Denodo VDP. Possible values areNEVER_REPLACE
,REPLACE_IF_NEW_COLUMNS
,ALWAYS_REPLACE
. If no value is provided for the parameterNEVER_REPLACE
is selected.NEVER_REPLACE
: It will not replace the Hive tables in the Denodo Embedded MPP and data source and base views at Denodo VDP if they already exist.REPLACE_IF_NEW_COLUMNS
: It will replace the Hive tables in the Denodo Embedded MPP, and their corresponding base views in Denodo VDP, if new columns are detected for the existing Hive tables. This may affect existing Denodo base views that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in the views that derive from these base views.ALWAYS_REPLACE
: It will replace the Hive tables in the Denodo Embedded MPP, and their corresponding base views in Denodo VDP, if any kind of change is detected in the existing Hive tables: new columns, deleted columns, different types of columns, etc. This may affect the existing base views in Denodo that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in views that derive from these base views.
Output of this Script¶
The output of the process would be:
the list of Hive tables created in the Denodo Embedded MPP
the JDBC data source created in Denodo, with the
presto.jdbcUrl
configured as theDatabase URI
of the data source in Denodo.the base views created in Denodo
METADATA DISCOVERY RESULTS
**************************
Commands executed at Denodo Embedded MPP:
-----------------------------------------
CREATE TABLE "customer" (
ANALYZE "customer"
CREATE TABLE "store" (
ANALYZE TABLE "store"
CREATE TABLE "store_sales" (
ANALYZE "store_sales"
CREATE TABLE "date_dim" (
ANALYZE "date_dim"
-----------------------------------------
Commands executed at Denodo VDP (data source):
----------------------------------------------
CREATE FOLDER '/01 - Connectivity';
CREATE FOLDER '/01 - Connectivity/01 - Data Sources';
CREATE DATASOURCE JDBC "ds_presto_k8s"
----------------------------------------------
Commands executed at Denodo VDP (base views):
---------------------------------------------
CREATE OR REPLACE FOLDER '/01 - Connectivity/02 - Base Views';
CREATE OR REPLACE WRAPPER JDBC bv_customer
CREATE OR REPLACE TABLE bv_customer I18N us_pst (
CREATE OR REPLACE WRAPPER JDBC bv_date_dim
CREATE OR REPLACE TABLE bv_date_dim I18N us_pst (
CREATE OR REPLACE WRAPPER JDBC bv_store
CREATE OR REPLACE TABLE bv_store I18N us_pst (
CREATE OR REPLACE WRAPPER JDBC bv_store_sales
CREATE OR REPLACE TABLE bv_store_sales I18N us_pst (