USER MANUALS

Metadata Discovery Tool

The metadata discovery tool scans the distributed storage, starting from a folder given by the user, and for each Parquet dataset it finds, in a folder or subfolder, creates the corresponding Hive table in Presto and the corresponding base view in Denodo.

The metadata discovery tool can create Presto tables from partitioned datasets that follow the Hive naming convention for folders, i.e.: columnName=value, elb_name=abc.

.../weblogs/elb_name=abc/elb_response_code=200/…
.../weblogs/elb_name=abc/elb_response_code=404/…
.../weblogs/elb_name=xyz/elb_response_code=200/…

The metadata discovery tool invokes sync_partition_metadata automatically, after each partition table creation.

The tool is distributed as a Denodo stored procedure and as a script.

Stored Procedure DISCOVER_PRESTO_METADATA

The DISCOVER_PRESTO_METADATA stored procedure is distributed in the folder metadata-discovery/sp/denodo-presto-k8scluster-{version}-jar-with-dependencies.jar.

In order to use this store procedure in Denodo, first, you have to import the file denodo-presto-k8scluster-{version}-jar-with-dependencies.jar using the menu option File > Extension management in VDP Administration.

Secondly, you have to create the stored procedure by clicking the menu option File > New > Stored procedure in the VDP Administration, and provide the following values:

  • Name: a name

  • Class name: com.denodo.connect.presto.MetadataDiscoveryStoredProcedure

  • Select Jars: denodo-discover-presto-metadata

Metadata Discovery Stored Procedure

Metadata Discovery Stored Procedure

Finally, you can execute the stored procedure:

  • Using the Execute button in the dialog that displays the schema of the stored procedure. Denodo will display a dialog where you have to enter the input values.

  • Using the VQL Shell:

    SELECT * FROM discoverprestometadata()
    WHERE dir_to_explore='s3a://bucket/dir/'
      and credentials_path='/absolute/path/to/core-site.xml'
      and datasource_name='the_presto_datasource'
      and duplicate_strategy='NEVER_REPLACE'
      and included_pattern=NULL
      and baseview_folder='/01 - Connectivity/02 - Base Views'
      and baseview_prefix='bv_'
      and analyze_tables=true
    

The input parameters required by the stored procedure are:

  • dir_to_explore: directory of the distributed storage path (s3a, adl, wasb, abfs), to start the metadata discovery process.

  • datasource_name: the name of a Presto data source already created in Denodo (for more information see Connect from Denodo section). The base views created by this stored procedure will belong to this Presto data source.

The following input parameters are optional:

  • credentials_path: absolute path to core-site.xml with the distributed storage credentials.

  • duplicate_strategy: chooses what to do with the metadata that already exists in Presto and Denodo. Possible values are NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, ALWAYS_REPLACE. If no value is provided for the parameter NEVER_REPLACE is selected.

    • NEVER_REPLACE: It will not replace the tables in Presto and base views at Denodo if they already exist.

    • REPLACE_IF_NEW_COLUMNS: It will replace the tables in Presto, and their corresponding base views in Denodo, if new columns are detected for the existing Presto tables. This may affect existing Denodo base views that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in the views that derive from these base views.

    • ALWAYS_REPLACE: It will replace the tables in Presto, and their corresponding base views in Denodo, if any kind of change is detected in the existing tables in Presto: new columns, deleted columns, different types of columns, etc. This may affect the existing base views in Denodo that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in views that derive from these base views.

  • included_pattern: a regular expression that will select only certain files or folders in the scanning process. For example: (.*)invoices(.*)

  • baseview_folder: all base views created by the stored procedure will be placed in this folder when this parameter is specified. The folder will be created by the stored procedure if necessary.

  • baseview_prefix: all base views created by the stored procedure will be prefixed with this value when this parameter is specified. When this parameter is not set and a base view with an invalid name is to be created, the prefix "bv" will be added.

  • analyze_tables: whether or not to execute the ANALYZE command after each table creation. This command collects statistical information about the data that the Presto optimizer uses to plan the query, based on cost strategies.

The output of this stored procedure is EXECUTED_COMMAND, which shows the DDL statements executed successfully in Presto and Denodo: - the SQL of each tables created in Presto - the VQL of each Presto wrapper and base view created in Denodo

Script for Metadata Discovery

The metadata discovery script is distributed in the metadata-discovery/bin folder. A configuration file is required as input:

./denodo-presto-metadata-discovery.sh ../conf/configuration.properties

Configuration File

dirToExplore=s3a://example-bucket/dir/

credentials.xml.path=../conf/core-site.xml
#includedPattern=(.*)blahblah(.*)

# Denodo connection properties
denodo.jdbcUrl=jdbc:vdb://localhost:9999/presto_tests
denodo.username=admin
denodo.password=admin

presto.jdbcUrl=jdbc:presto://localhost:8443/hive/default?SSL=true\&SSLTrustStorePath=truststore-path&SSLTrustStorePassword=truststore-pwd
presto.username=presto
presto.password=pr3st%

# Default values
denodo.datasource.name=ds_presto_k8s
denodo.datasource.folder=/01 - Connectivity/01 - Data Sources
denodo.baseview.folder=/01 - Connectivity/02 - Base Views
denodo.baseview.prefix=bv_

presto.analyze.tables=true

# Possible values: NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, REPLACE_ALWAYS
metadata.duplicated.strategy=NEVER_REPLACE
  • dirToExplore: directory of the distributed storage path (s3a, adl, wasb, abfs), to start the metadata discovery process.

  • credentials.xml.path: core-site.xml with the distributed storage credentials.

  • includedPattern: a regular expression that will select only certain files or folders in the scanning process. For example: (.*)invoices(.*).

  • Denodo connection properties: JDBC URI, user and password.

  • Presto Connection properties: JDBC URI, user and password.

  • denodo.datasource.name: the name of the Presto data source that will be created in Denodo by this script.

  • denodo.datasource.folder: the Presto data source created in Denodo by the script will be placed in this folder when this parameter is specified. The folder will be created by the script if needed.

  • denodo.baseview.folder: all base views created by the stored procedure will be placed in this folder when this parameter is specified. The folder will be created by the script if necessary.

  • denodo.baseview.prefix: all base views created by the stored procedure will be prefixed with this value when this parameter is specified. When this parameter is not set and a base view with an invalid name is to be created, the prefix "bv" will be added.

  • presto.analyze.tables: whether or not to execute the ANALYZE command after each table creation. This command collects statistical information about the data that the Presto optimizer uses to plan the query, based on cost strategies. Default is true.

  • metadata.duplicated.strategy: chooses what to do with the metadata that already exists in Presto and Denodo. Possible values are NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, ALWAYS_REPLACE. If no value is provided for the parameter NEVER_REPLACE is selected.

    • NEVER_REPLACE: It will not replace the tables in Presto and data source and base views at Denodo if they already exist.

    • REPLACE_IF_NEW_COLUMNS: It will replace the tables in Presto, and their corresponding base views in Denodo, if new columns are detected for the existing Presto tables. This may affect existing Denodo base views that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in the views that derive from these base views.

    • ALWAYS_REPLACE: It will replace the tables in Presto, and their corresponding base views in Denodo, if any kind of change is detected in the existing tables in Presto: new columns, deleted columns, different types of columns, etc. This may affect the existing base views in Denodo that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in views that derive from these base views.

Output of this Stored Procedure

The output of the process would be:

  • the list of tables created in Presto

  • the Presto data source created in Denodo, with the presto.jdbcUrl configured as the Database URI of the data source in Denodo.

  • the Presto base views created in Denodo

METADATA DISCOVERY RESULTS
**************************

Commands executed at Presto:
----------------------------
CREATE TABLE "customer" (
ANALYZE "customer"
CREATE TABLE "store" (
ANALYZE TABLE "store"
CREATE TABLE "store_sales" (
ANALYZE "store_sales"
CREATE TABLE "date_dim" (
ANALYZE "date_dim"
----------------------------

Commands executed at Denodo (data source):
------------------------------------------
CREATE FOLDER '/01 - Connectivity';
CREATE FOLDER '/01 - Connectivity/01 - Data Sources';
CREATE DATASOURCE JDBC "ds_presto_k8s"
------------------------------------------

Commands executed at Denodo (base views):
-----------------------------------------
CREATE OR REPLACE FOLDER '/01 - Connectivity/02 - Base Views';
CREATE OR REPLACE WRAPPER JDBC bv_customer
CREATE OR REPLACE TABLE bv_customer I18N us_pst (
CREATE OR REPLACE WRAPPER JDBC bv_date_dim
CREATE OR REPLACE TABLE bv_date_dim I18N us_pst (
CREATE OR REPLACE WRAPPER JDBC bv_store
CREATE OR REPLACE TABLE bv_store I18N us_pst (
CREATE OR REPLACE WRAPPER JDBC bv_store_sales
CREATE OR REPLACE TABLE bv_store_sales I18N us_pst (
Add feedback