Denodo Embedded MPP - User Manual

Download original document


Overview

Since the Update 8.0u20230301, Denodo has embedded its own MPP engine, the Denodo Embedded MPP, based on Presto.

Presto is a high performance, distributed SQL query engine for big data. It was originally created by Facebook to provide self-service analytics on top of their massive data sets, and now is open source.

The main goal of the Denodo Embedded MPP is to provide efficient access to data lake content: Parquet files, Delta Lake tables and Apache Iceberg tables, in an easy way, using only SQL.

Also, with the Denodo Embedded MPP there is no need for additional external engines. The use of an MPP engine and Object Storage brings out-of-the-box options for those Denodo capabilities that require storage, such as, caching, query acceleration, remote tables, summaries, etc.

The main steps to use the Denodo Embedded MPP are:

  1. Store your dataset in the Object Storage.

The Denodo Embedded MPP can read data from many distributed storage systems, such as:

  • Hadoop Distributed File System (HDFS)
  • Amazon S3
  • Azure Blob Storage
  • Azure Data Lake Storage Gen1 and Gen2
  • Google Cloud Storage
  • S3-compatible storage

  1. Deploy the Denodo Embedded MPP on Kubernetes, registering as a data source in the Denodo Platform.

  1. Graphically explore your storage from Denodo to introspect the structure of the Parquet files and its data types. Create the corresponding Hive tables in Presto and the base views in Denodo.

  1. Query your data lake data in Denodo using MPP acceleration.

  1. Load data in your data lake using Denodo capabilities like caching, remote tables or summaries.

The Denodo Embedded MPP requires the subscription bundle Enterprise Plus and the Denodo 8.0 Update 8.0u20230301.

And if you have a cluster of Denodo servers it needs to be configured to store its metadata in an external database to take full advantage of the Denodo Embedded MPP functionalities.

Architecture

                                        Denodo Embedded MPP architecture

Denodo Embedded MPP is shipped with an embedded Hive Metastore that acts as a repository of metadata mapping Object Storage (S3/ADLS/GCS/HDFS) Parquet files to Hive tables. This embedded Hive Metastore stores the metadata in an embedded PostgreSQL. Users can also choose to use an alternative external database (PostgreSQL, MySQL, SQL Server or Oracle) to work with the embedded Hive Metastore.

When using the embedded PostgreSQL, the cluster uses a Kubernetes Persistent Volume, to ensure the persistence of the metadata. This way, all the cluster services: Presto, Hive Metastore and PostgreSQL, can be temporarily down and we will not lose any of the table definitions that the Denodo Embedded MPP relies upon.

Kubernetes Infrastructure

Denodo Embedded MPP is designed to run on Kubernetes 1.23+ with:

  • Amazon Elastic Container Service for Kubernetes (EKS)

  • Azure Kubernetes Service (AKS)

  • Red Hat OpenShift

  • Google Kubernetes Engine (GKE)

The main steps for deploying it on a Kubernetes cluster are:

  1. Create the Kubernetes cluster.

It is recommended to create a cluster with N + 2 nodes, with no other applications running on the nodes:

  • One single node for the Embedded MPP coordinator.
  • One single node for the embedded Hive Metastore and its embedded PostgreSQL.
  • N nodes for the Embedded MPP workers: one single node for each MPP worker.

Since the Denodo Embedded MPP requires a certain amount of CPU and memory to process queries, it is also recommended to start with nodes with at least 16 cores and 64GB of memory.

See Sizing recommendations for the Embedded MPP for more details.

  1. Decide how the Denodo Embedded MPP will be exposed to Denodo VDP, based on the Kubernetes service type. The options are:
  • LoadBalancer. It is the option configured by default.
  • ClusterIP and Ingress. For this, an ingress controller, such as the NGINX Ingress Controller, must be installed on the cluster.
  • ClusterIP and Route. This option is only available in OpenShift, since routes are specific to OpenShift.

  1. When using the embedded PostgreSQL make sure the cluster has configured Kubernetes Storage Classes to provision Kubernetes Persistent Volumes. Because in this case, the cluster will use a Persistent Volume to ensure the persistence of metadata.

  1. The Denodo Embedded MPP Helm chart and the container images are available at the Denodo's Harbor Registry and also included in the distribution.

  1. Configure the Denodo Embedded MPP through the Helm chart values.yaml file. See Configuration section below.

  1. Deploy the Denodo Embedded MPP using the cluster.sh script in the distribution. See Deployment section below.

kubectl, Helm v3 and Java are required.

  1. Autoscaling is optional but recommended.

For autoscaling in AWS EKS see Configuring an Autoscaling Denodo Embedded MPP Cluster in EKS.

For autoscaling in Azure AKS see Configuring an Autoscaling Denodo Embedded MPP Cluster in AKS.

Configuration

Before deploying the Denodo Embedded MPP you should configure it through the Helm chart prestocluster/values.yaml file.

Container Registry

image.repository:

Container registry containing the Denodo Embedded MPP images: Presto, Hive Metastore and PostgreSQL.

The Denodo's Harbor Registry is configured by default:

repository: "harbor.open.denodo.com/denodo-connects-8.0/images",

but you can use a private container registry of your own. In this latter case, you will have to pull the Embedded MPP images from the Denodo's Harbor Registry or load them from the distribution, tag them appropriately and push them to your private Container registry.

image.pullSecret:

References the secret with the credentials that will be used to download images from the Container registry. If your Container registry doesn't need one, leave it empty:

pullSecret: ""

You can create the secret like this:

kubectl create secret docker-registry denodo-registry-secret --docker-server=harbor.open.denodo.com --docker-username=<denodo_account_username> --docker-password=<registry_profile_secret>

In case you are using the Denodo's Harbor Registry, you can find your denodo_account_username and registry_profile_secret at the Denodo's Harbor Registry. Open the “User Profile” menu and click on “Generate Secret”. This CLI secret is the registry_profile_secret and the Username is the denodo_account_username.

Once the secret is created you need to include the name denodo-registry-secret in the image.pullSecret key.

There is an alternative to creating the pullSecret, that is to add the Container registry credentials to the image.pullCredentials section and let the Helm chart handle the secret creation itself.

image.pullCredentials:

Section to configure the Container registry credentials, so the container images can be downloaded in the cluster deployment process.

  • enabled: Whether to use the credentials in this section to download the Container registry images.

  • registry: The Container registry. The Denodo's Harbor Registry is configured by default: "harbor.open.denodo.com"
  • username: User to access the Container registry.
  • pwd: Credentials to access the Container registry.

Exposing Denodo Embedded MPP

presto.service.type:

The Kubernetes service type allows users to decide how to expose the Denodo Embedded MPP to the Denodo Platform.

Possible values are:

  • LoadBalancer. It is the option configured by default.

  • ClusterIP. This is usually combined with:
  • Ingress. See Ingress section for details on how to configure an Ingress.

  • Route. This option is only available in OpenShift, since routes are specific to OpenShift. You will have to create the route manually.

  • NodePort

presto.service.annotations:

Presto service annotations. A typical use case for these annotations if the presto.service.type is LoadBalancer, is to configure the service to use a static IP address instead of an ephemeral IP address assigned by the load balancer.

presto.service.loadBalancerIP:

An alternative way to presto.service.annotations to specify a static IP address for services of type LoadBalancer. But note that loadBalancerIP is deprecated since Kubernetes v1.24 and users are encouraged to use implementation-specific annotations (Azure AKS, Amazon EKS, etc.) through presto.service.annotations.

presto.service.loadBalancerSourceRanges:

List of IPs to restrict traffic through the load balancer. Only applies to presto.service.type: LoadBalancer.

Ingress

ingress.enabled:

Whether to use an Ingress to expose HTTP(S) routes from outside the Denodo Embedded MPP. An Ingress is an alternative to creating a dedicated load balancer in front of Kubernetes services, or manually exposing services within a node.

The Ingress is usually combined with ClusterIP as the service type: presto.service.type property. This way, only internal network connections between services within the cluster are allowed, making it more secure.

The Ingress accepts TLS connections and forwards them to the Embedded MPP on the HTTP port 8080:

In order for the Ingress to work, the cluster must have an ingress controller running, such as NGINX Ingress Controller.

ingress.host:

The hostname of the Denodo Embedded MPP.

ingress.classname:

The Ingress class to be used. If the default Ingress class is used, this parameter can be omitted.

ingress.tls:

Whether to secure the external traffic to the Ingress or not. Enabled by default.

  • secretName: the name of an existent secret containing a TLS private key and certificate.

        OR

  • create: if true a secret will be created containing a TLS private key and a certificate. In this case, the file names should be configured in certFile and keyFile and the files must be placed in the prestocluster/presto directory for the secret to be created.

SSL/TLS

The Denodo Embedded MPP is shipped with a Java Keystore, prestocluster/presto/secrets/presto.jks, that contains a self-signed certificate, certs/certificate.crt, which is distributed for testing purposes ONLY.

This self-signed certificate accepts presto-denodo as the hostname of the Denodo Embedded MPP. Therefore, you will  have to add an entry in the hosts file where the Denodo server is running, with presto-denodo and the IP that appears as the EXTERNAL-IP of the Presto service.

It is strongly recommended to use a certificate issued by a CA in production replacing the actual Java Keystore: prestocluster/presto/secrets/presto.jks before deploying the cluster. See how to create a Keystore in Send a Certificate Request to a Certificate Authority (CA) and Create a Keystore with the Reply.

In this case you should maintain the location of the keystore: prestocluster/presto/secrets. But if you do not want to maintain the name, presto.jks, and the password: sslpassphrase, you will have to edit the following properties in the prestocluster/presto/conf/config.properties.coordinator file:

http-server.https.keystore.key=sslpassphrase

http-server.https.keystore.path=/opt/secrets/presto.jks

Also, make sure you are using a certificate with a single element in the SubjectAlternativeName (SAN) field, as shown below:

SubjectAlternativeName [

    DNSName: *.acme.com

]

Finally, take into account that if the certificate used by the Denodo Embedded MPP is signed by a private authority, or it is self-signed, you have to import this certificate into the truststore of the Denodo servers:

.<DENODO_HOME>/jre/bin/keytool -importcert -alias denodo-mpp -file <DENODO_MPP_HOME>/certs/certificate.crt -cacerts -storepass "changeit" -noprompt

Workers, CPU and Memory

presto.numWorkers: 

The number of Workers in the cluster. Make sure that each Worker is placed on one single cluster node, with no other applications running on the nodes.

Since the recommendation is to create a cluster with N+2 nodes: one for the Coordinator, one for the Embedded Hive Metastore and the Embedded PostgreSQL and N Workers,  presto.numWorkers should be N.

See also Sizing recommendations for the Embedded MPP.

presto.cpusPerNode:

The number of cores assigned to each Worker. Since the recommendation is to place each Worker on a cluster node, this should be the number of cores of the node.

The Denodo Embedded MPP requires a certain amount of CPU to process the data for a given query. Therefore, it is recommended to start with nodes with at least 16 cores. And, in general, if you double the CPU in the cluster, keeping the same memory, the query will take half the time.

As a generalization, more CPUs mean shorter queries.

presto.memoryPerNode:

The total memory in GB assigned to each Worker. Since the recommendation is to place each Worker on a cluster node, this should be the total memory of the node. This setting will determine the JVM max memory for the nodes running the Coordinator and the Workers, since

JVM max memory = memoryPerNode * 0.8.

The Denodo Embedded MPP requires a certain amount of memory to process queries with JOIN, GROUP BY, and ORDER BY operations, it is therefore recommended to start with big nodes with at least 64GB of memory.

resources:

  • presto.coordinator.resources: Resources definition for the Coordinator
  • presto.worker.resources: Resources definition for the Workers

Kubernetes schedules pods across nodes based on the resource requests and limits of each container pod. If a container pod requests certain CPU and/or memory values, Kubernetes will only schedule it on a node that can guarantee those resources. Limits, on the other hand, ensure that a container pod never exceeds a certain value.

Note that you cannot set requests that are larger than the resources provided by your nodes.

Note also that resources are commented out, as we leave this setting as a choice for the cluster administrator.

Memory Settings

It is also important to adjust memory settings for query performance, finding a balance between maximum memory per query and the maximum number of concurrent queries that can be run in the Denodo Embedded MPP.

The max amount of user memory a query can use on a worker, the query.max-memory-per-node, has as default value the JVM max memory * 0.1. The JVM max memory that the Embedded MPP allocates is the 80% of the memory machine (the memoryPerNode configured in the values.yaml). So, by default the memory available for executing queries in a node is memoryPerNode * 0.8 * 0.1. But with very large queries the Embedded MPP could throw Query exceeded per-node user memory limit of xxx.xx, meaning that it needs more memory to handle queries. In this case you can configure the memory settings in the additionalConfig entries of the values.yaml:

presto:

  coordinator:

    additionalConfig: [

      query.max-memory-per-node=xGB,

      query.max-total-memory-per-node=yGB,

      query.max-memory=zGB

    ]

presto:

  worker:

    additionalConfig: [

      query.max-memory-per-node=xGB,

      query.max-total-memory-per-node=yGB,

      query.max-memory=zGB

    ]

Most important memory properties are:

  • The query.max-memory-per-node property is the maximum amount of user memory a query can use on a Worker. 

The default value is JVM max memory * 0.1.


If this is not enough, our recommendation as a starting point is to set query.max-memory-per-node = JVM max memory * 0.5.

Take into account that increasing the default query.max-memory-per-node can improve the performance of large queries, but also may reduce the available memory for other queries in highly concurrent scenarios.

  • The query.max-total-memory-per-node property is the maximum amount of user and system memory a query can use on a Worker.

The default values is query.max-memory-per-node * 2.

If this is not enough, our recommendation as a starting point is to set query.max-total-memory-per-node = JVM max memory * 0.6.

Take into account that increasing the default query.max-total-memory-per-node can improve the performance of large queries, but also may reduce the available memory for other queries in highly concurrent scenarios.

  • The query.max-memory property is the maximum amount of user memory that a query can use across all workers in the cluster. For example, memory used by hash tables built during execution, memory used during sorting, etc.

The default value is 20GB.

But if the cluster needs to handle bigger queries, you should make query.max-memory bigger, our recommendation is query.max-memory = query.max-memory-per-node * numWorkers (from the values.yam).

If the recommendation does not work well in your scenario, you can run the EXPLAIN ANALYZE with these big queries and see what is the max memory peak examining the query plan.

In addition to adjusting the memory settings, sometimes, the only solution to handle large queries is to use instances with more memory or adding more nodes to the cluster.

Denodo Server

presto.denodoConnector:

This section configures the connection details to Denodo. Denodo will use Presto as its embedded Massively Parallel Processing (MPP) to accelerate queries.

This feature requires the Denodo subscription bundle Enterprise Plus and the Denodo 8.0 20230301 Update.

And if you have a cluster of Denodo servers it needs to be configured to store its metadata in an external database to take full advantage of the Denodo Embedded MPP functionalities.

server: 

Denodo server uri with the format: "//<ip_or_hostname>:<port>/admin_denodo_mpp".

Make sure that the Denodo server is accessible from the Denodo Embedded MPP.

user: 

Denodo user, denodo_mpp_user by default.

password:

Denodo user password, password for denodo_mpp_user by default.

Starting with the update 8.0u20230712-beta, password must be compliant with Denodo password policies.

It is recommended to specify the password encrypted to avoid entering it as plain text. If the password is encrypted, prefix it with encrypted:, e.g.

"encrypted:<encrypted_password>".

You can encrypt the password using the <DENODO_HOME>/bin/encrypt_password.sh script.

ssl: 

Whether SSL is enabled in the Denodo server.

trustStore: 

You only need to configure this trustStore parameter if the certificate used by the Denodo server is signed by a private authority or it is self-signed. In this case, copy the trust store file containing the Denodo SSL certificate, to the prestocluster/presto directory and set the file name here in the trustStore parameter.

trustStorePassword: 

Password of the trust store, only if the trustStore property is configured.

Catalogs

Catalogs in the Denodo Embedded MPP are the equivalent to data sources in Denodo Virtual Dataport.

The Denodo Embedded MPP is shipped with:

  • a hive catalog for accessing Hive tables over Parquet files, connected to the Embedded Hive Metastore.

  • a delta catalog for accessing Delta Lake tables, connected to the Embedded Hive Metastore.

  • an iceberg catalog for accessing Iceberg tables, connected to the Embedded Hive Metastore.

You can define new catalogs creating the catalog properties file in prestocluster/presto/conf/catalog/, e.g., prestocluster/presto/conf/catalog/glue_hive.properties. The file name, glue_hive in this case, would be the catalog name. Then you have to add the type of catalog to the connector.name property -hive-hadoop2, delta and iceberg are supported- and any other properties required by the catalog type.

connector.name=hive-hadoop2

hive.metastore=glue

# AWS region of the Glue Catalog

hive.metastore.glue.region=

# The ID of the Glue Catalog in which the metadata database resides

hive.metastore.glue.catalogid=

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

# For Bulk Data load

hive.allow-drop-table=true

hive.non-managed-table-writes-enabled=true

hive.parquet.use-column-names=true

                Catalog for Hive tables in AWS Glue Data Catalog

Hive for Parquet files

The catalog type hive-hadoop2 is used to access Parquet datasets stored in any Object Storage. Datasets are mapped to Hive tables in a Metastore, that can be the Embedded Hive Metastore or any external Metastore.

connector.name=hive-hadoop2

# Embedded Hive Metastore

hive.metastore.uri=thrift://hive-metastore:9083

                                        hive.properties

To query Parquet datasets you have to create Hive tables in the Denodo Embedded MPP, but you can use the embedded data source in Denodo to graphically explore Parquet datasets (including those using the Hive style partitioning), create tables and base views in Denodo.

This feature requires the Denodo subscription bundle Enterprise Plus and the Denodo 8.0 20230301 Update.

Explore Parquet files

Features

The Denodo Embedded MPP provides the following features when treating with Parquet files:

Limitations

  • Inserts: it is not possible to insert data into views created from Parquet files using the From Object Storage tab of the Embedded MPP data source.

For more information see Manage Views Created from Parquet Files.

Delta Lake

Delta Lake is open source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

The Denodo Embedded MPP allows reading data stored in Delta Lake tables. For this, it needs a Metastore as a metadata catalog, that can be the Embedded Hive Metastore or an external Metastore. And a catalog of type delta.

connector.name=delta

# Embedded Hive Metastore

hive.metastore.uri=thrift://hive-metastore:9083

                                        delta.properties

To query Delta Lake tables you have to manually register those tables in the Embedded MPP's Metastore through the CREATE TABLE sentence. Since the schema and the data file list are located in the Delta Log at the table's location, you need to provide a dummy column as the schema of the Delta Lake table, to avoid the no columns error in the Metastore:

CREATE TABLE delta.default.orders (

   dummy bigint

) WITH (

     external_location = 's3a://my_bucket/path/to/folder/',

     format = 'PARQUET'

);

The WITH clause of the CREATE TABLE can also be used to set other properties on the table. See Delta Tables Properties.

Once the Delta Lake table is registered, you can use the embedded data source in Denodo to create a Denodo base view on top of the table using the From MPP Catalogs tab.

                                        Explore Delta Lake tables

Query Delta Lakes tables directly

Another option is to query the table directly using the table location as the table name without registering it in the Metastore.

SELECT * FROM delta."$path$"."s3a://my_bucket/path/to/folder/";

Features

The Denodo Embedded MPP provides the following features when treating with Delta Lake tables:

  • Create base views over existing Delta Lake tables in the Metastore (embedded or external)
  • Querying
  • Embedded MPP acceleration

Limitations

  • Create base views over data stored in Delta Lake format
  • Bulk data load
  • Caching: full cache mode
  • Remote tables

Iceberg

Apache Iceberg is a high-performance table format for large analytic datasets. Iceberg tables allow schema evolution, partition evolution, and table version rollback, without the need to rewrite or migrate tables.

Denodo Embedded MPP allows querying data stored in Iceberg tables. For this, it needs a Metastore as a metadata catalog, that can be the Embedded Hive Metastore or an external Metastore. And a catalog of type iceberg.

connector.name=iceberg

# Embedded Hive Metastore

hive.metastore.uri=thrift://hive-metastore:9083

                                        iceberg.properties

Before creating Iceberg tables you need to create a new schema that sets the location where the tables -- their Parquet files and metadata files -- will be placed.

For this you can use your favorite SQL client:

CREATE SCHEMA iceberg.<schema_name> WITH (location = '<filesystem_schema>://<host>/<folders>');

or the Denodo stored procedure CREATE_SCHEMA_ON_SOURCE:

CALL CREATE_SCHEMA_ON_SOURCE('admin_denodo_mpp', 'embedded_mpp', 'iceberg', '<schema_name>', '<filesystem_schema>://<host>/<folders>');

The CREATE TABLE sentence is required to create tables:

CREATE TABLE orders (

    orderkey bigint,

    custkey bigint,

    orderstatus varchar,

    totalprice double,

    orderpriority varchar,

    clerk varchar,

    shippriority integer,

    comment varchar

) WITH (

     location = 's3a://my_bucket/path/to/folder/',

     format = 'PARQUET'

);

The WITH clause of the CREATE TABLE can also be used to set other properties on the table. See Iceberg Tables Properties.

The CREATE TABLE sentence creates an empty table even if there is data already present in the S3 folder location. The creation of Iceberg tables does not work in the same way as the creation of Hive tables since, in the latter case, existing data in the S3 bucket compatible with the table schema would be considered as the contents of the table. But for an Iceberg table, additional metadata is required for it.

Therefore, to access Parquet datasets using Iceberg tables you must create Hive tables for those Parquet datasets and then use the CREATE TABLE AS SELECT (CTAS) statement to create the new Iceberg tables from those Hive tables.

CREATE TABLE IF NOT EXISTS iceberg.schema.ctas_orders

AS (SELECT * FROM hive.default.orders);

# partition tables

CREATE TABLE IF NOT EXISTS iceberg.schema.ctas_weblog  

WITH (

        partitioning = ARRAY['elb_name', 'elb_response_code']

)

AS (SELECT * FROM hive.default.weblog)

The drawback of this method is that it will temporarily duplicate the storage of the dataset, since it will store both the data for the Hive table and the data for the new Iceberg table.

The alternative is to use the migrate Spark procedure, which replaces the existing Hive table with an Iceberg table using the same data files, as there is no Presto support for migrating Hive tables to Iceberg.

There is also no support for registering existing Iceberg tables in Presto, but the register_table Spark procedure can be used as an alternative.

Once the Iceberg table is registered you can use the embedded data source in Denodo to create a Denodo base view on top of the table using the From MPP Catalogs tab.

Explore Iceberg tables

Features

The Denodo Embedded MPP provides the following features when treating with Iceberg tables:

Limitations

  • Create base views over data stored in Iceberg format
  • Bulk data load
  • Caching: full cache mode
  • Remote tables

Embedded Hive Metastore

Denodo Embedded MPP is shipped with an embedded Hive Metastore that acts as a repository of metadata, storing it in an embedded PostgreSQL. But you can also choose to use an alternative external database (PostgreSQL, MySQL, SQL Server or Oracle) to work with the embedded Hive Metastore. You can configure it using the metastore.connectionUrl and metastore.connectionDriverName parameters and setting postgresql.enabled to false.

metastore.connectionUrl:

JDBC connection string for the database of the embedded Hive Metastore, which can be:

  • the embedded PostgreSQL, the default one: jdbc:postgresql://postgresql:5432/metastore
  • an external MySQL
  • an external Oracle
  • an external PostgreSQL
  • an external SQL Server

metastore.connectionDriverName: 

JDBC Driver class name to connect to the database of the embedded Hive Metastore, which can be:

  • org.postgresql.Driver for PostgreSQL, the default one
  • org.mariadb.jdbc.Driver for MySQL
  • com.microsoft.sqlserver.jdbc.SQLServerDriver for SQL Server
  • oracle.jdbc.OracleDriver for Oracle

In addition, there is an initialization script for the external database: MySQL, Oracle, PostgreSQL or SQL Server, included in the prestocluster/hive-metastore/scripts that must be run on that external database before deploying the cluster.

Supported Databases for Embedded Hive Metastore

Database

Minimum supported version

MySQL

5.6.17

Postgres

9.1.13

Oracle

11g

           MS SQL Server

            2008 R2

       

External Metastore

Denodo Embedded MPP can connect to external Metastores to access already created tables -Hive, Delta Lake and Iceberg- instead of starting from scratch.

External Hive Metastore

In case that you already have a Hive Metastore containing table definitions you want to access from the Denodo Embedded MPP, you could use that Hive Metastore as an external Metastore for the Embedded MPP.

For this, you can manually define a new catalog creating the catalog properties file in prestocluster/presto/conf/catalog/, e.g., prestocluster/presto/conf/catalog/external_hive.properties. The file name, external_hive in this case, would be the catalog name.

For creating the new catalog properties file you can copy one of the catalogs shipped by default, depending on the files or tables that we want to read:

  • copy the hive.properties for accessing Hive tables over Parquet files.
  • copy the delta.properties for accessing Delta Lake tables.
  • copy the iceberg.properties for accessing Iceberg tables.

Then, edit the hive.metastore.uri property with the connection parameters of the external Hive Metastore:

hive.metastore.uri=thrift://acme.com:9083

Typically, you will have to add the hdfs-site.xml and the core-site.xml files to prestocluster/presto/conf/catalog and reference them in the hive.config.resources property:

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml,/opt/presto-server/etc/catalog/hdfs-site.xml

If the external Hive Metastore uses kerberos authentication, you will need to configure additional properties in the catalog file:

hive.metastore.authentication.type=KERBEROS

hive.metastore.service.principal=hive/_HOST@REALM

hive.metastore.client.principal=primary@REALM

hive.metastore.client.keytab=/opt/secrets/xxx.keytab

Placing the xxx.keytab file in the prestocluster/presto/secrets folder.

You also need to place the krb5.conf in the prestocluster/presto/conf/catalog/ folder. And add to the values.yaml:

  additionalJVMConfig: [

   -Djava.security.krb5.conf=/opt/presto-server/etc/catalog/krb5.conf

  ]

If additionally, the Embedded MPP must authenticate to HDFS using Kerberos, you will need to configure additional properties in the catalog file:

hive.hdfs.authentication.type=KERBEROS

hive.hdfs.presto.principal=primary@REALM

hive.hdfs.presto.keytab=/opt/secrets/xxx.keytab

You can find more information in Hive Security Configuration — Presto Documentation.

Don't forget to disable the Embedded PostgreSQL if you don't need it, in the prestocluster/values.yaml, postgresql.enabled=false, so that the Embedded PostgreSQL is not deployed.

AWS Glue Data Catalog

In case that you already have a AWS Glue Data Catalog containing table definitions you want to access from the Denodo Embedded MPP, you could use the AWS Glue Data Catalog as an external Metastore for the Embedded MPP.

For this, you can manually define a new catalog creating the catalog properties file in prestocluster/presto/conf/catalog/, e.g., prestocluster/presto/conf/catalog/glue_hive.properties. The file name, glue_hive in this case, would be the catalog name.

For Hive tables on top of Parquet files:

connector.name=hive-hadoop2

hive.metastore=glue

# AWS region of the Glue Catalog

hive.metastore.glue.region=

# The ID of the Glue Catalog in which the metadata database resides

hive.metastore.glue.catalogid=

# Access Key and Secret Key for Glue

# Credentials and core-site.xml are not required when the cluster

# runs in EC2 because it will use the instance profile credentials

hive.metastore.glue.aws-access-key=

hive.metastore.glue.aws-secret-key=

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

# For Bulk Data load

hive.allow-drop-table=true

hive.non-managed-table-writes-enabled=true

hive.parquet.use-column-names=true

                                        glue_hive.properties

For Delta Lake tables:

connector.name=delta

hive.metastore=glue

# AWS region of the Glue Catalog

hive.metastore.glue.region=

# The ID of the Glue Catalog in which the metadata database resides

hive.metastore.glue.catalogid=

# Access Key and Secret Key for Glue

# Credentials and core-site.xml are not required when the cluster

# runs in EC2 because it will use the instance profile credentials

hive.metastore.glue.aws-access-key=

hive.metastore.glue.aws-secret-key=

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

#

hive.parquet.use-column-names=true

                                        glue_deltalake.properties

For Iceberg tables:

connector.name=iceberg

iceberg.catalog.type=HIVE

hive.metastore=glue

# AWS region of the Glue Catalog

hive.metastore.glue.region=

# The ID of the Glue Catalog in which the metadata database resides

hive.metastore.glue.catalogid=

# Access Key and Secret Key for Glue

# Credentials and core-site.xml are not required when the cluster

# runs in EC2 because it will use the instance profile credentials

hive.metastore.glue.aws-access-key=

hive.metastore.glue.aws-secret-key=

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

#

hive.parquet.use-column-names=true

                                        glue_iceberg.properties

These are the AWS privileges required by the Embedded MPP when accessing the AWS Glue Data Catalog:

  • Reading from AWS Glue:
  • glue:GetDatabases
  • glue:GetDatabase
  • glue:GetTables
  • glue:GetTable
  • glue:GetPartitions
  • glue:GetPartition
  • glue:BatchGetPartition

  • Writing to AWS Glue. Same as for reading and also:
  • glue:CreateTable
  • glue:DeleteTable
  • glue:UpdateTable
  • glue:BatchCreatePartition
  • glue:UpdatePartition
  • glue:DeletePartition

Embedded PostgreSQL

postgresql.enabled:

Whether to use the embedded PostgreSQL or an external database for the embedded Hive Metastore. true by default.

The externally-managed database option has the advantage of keeping the metadata outside the cluster lifecycle. And in some cases, it is the only option, when there are policies restricting the type of RDBMS that can be installed, backups, maintenance, etc.

Persistent Storage

When using the embedded PostgreSQL, the Denodo Embedded MPP uses a Kubernetes Persistent Volume, to ensure the persistence of metadata.

But the Denodo Embedded MPP deployment does not include a Persistent Volume object, as the user instantiating it may not have permission to create Persistent Volumes. It includes a PersistentVolumeClaim, that used in conjunction with Storage Class, dynamically requests the Persistent Volume. At least one Storage Class has to be defined in your cluster.

Therefore, to configure the Denodo Embedded MPP storage, there are two options:

  1. Use the actual definition that causes a Persistent Volume to be automatically provisioned for the cluster with the default Storage Class. Many cluster environments have a default Storage Class installed, or the administrators can create their own default Storage Class.

  1. Provide a Storage Class name into the postgresql.pvClaim.storageClassName field in the values.yaml file.

You can use kubectl to check for StorageClass objects:

kubectl get sc

sc is an acronym for StorageClass. The output is similar to this:

NAME                         PROVISIONER               AGE

standard (default)           kubernetes.io/gce-pd      1d

gold                         kubernetes.io/gce-pd      1d

The default StorageClass is marked by (default).

postgresql.pvClaim.storage:

Storage size requested in case of using the embedded PostgreSQL.

Default size is 5Gi, but it should be configured according to your scenario and the volumes of data, as it is related to the metadata, which you plan to use.

postgresql.pvClaim.annotations:

Annotations for persistence storage. A typical use case for these annotations is if you want to preserve the Denodo Embedded MPP metadata after cluster removal you can add the following annotation:

"helm.sh/resource-policy": keep

Pod Scaling

When you deploy the Denodo Embedded MPP, you define in prestocluster/values.yaml how many Workers (presto.numWorkers) you want to run. But you can scale the Denodo Embedded MPP later, increasing or decreasing the number of Workers.

It is recommended to use pod-based scaling together with node-based scaling to coordinate scalability of the pods with the behavior of the nodes in the cluster. For example, when you need to scale up, the cluster autoscaler can add new nodes, and when scaling down, it can shut down unneeded nodes.

For autoscaling in AWS EKS see Configuring an Autoscaling Denodo Embedded MPP Cluster in EKS.

For autoscaling in Azure AKS see Configuring an Autoscaling Denodo Embedded MPP Cluster in AKS.

Manually

You can add or remove pod replicas manually with the kubectl scale command, changing the number of Workers (replicas) you want to run:

$ kubectl scale deployment/presto-worker --replicas=6

Needless to say, the cluster must have sufficient resources.

Autoscaling

presto.autoscaling:

A HorizontalPodAutoscaler template is distributed, but it is disabled by default in the presto.autoscaling.enabled parameter.

Horizontal scaling means that the response to an increase in load is to deploy more pods. If the load decreases, and the number of pods is greater than the configured minimum, HorizontalPodAutoscaler instructs the workload resource to scale down.

If you want autoscaling to be based on the utilization of a resource as a percentage, you must specify requests for that resource in the presto.worker.resources parameter.

For example, consider the value you want to set for your presto.worker.resources.requests.cpu. This represents the amount of CPU reserved for the pod. We recommend setting the value to be 80% of the total number of CPUs of a Worker as a starting point. So for 32 core nodes you can set the value to 25.6 or 25600m, each of which represents 25.6 cores:

This can then be adjusted as needed. The autoscaler will use this value along with the targetCPUUtilizationPercentage to determine if a new Worker is needed.

You can check the current status of the HorizontalPodAutoscaler running:

$ kubectl get hpa

The output is similar to:

NAME          REFERENCE                 TARGETS   MINPODS   MAXPODS   REPLICAS

presto-worker Deployment/presto-worker  22%/80%   2             6       2     

There are two requirements to ensure the HorizontalPodAutoscaler (HPA) works:

  1. A monitoring tool is installed for providing metrics to Kubernetes: Metrics Server, Prometheus, etc.

  1. The Worker has CPU resource requests, presto.worker.resources, defined in the prestocluster/values.yaml file. If the CPU utilization is not configured the autoscaler (HPA) will not take any action.

Note, that resources are not configured by default, as we leave this configuration as a choice for the cluster administrator.

Resource Groups

In an environment with multiple concurrent user sessions running queries, not all user sessions have the same importance. You may want to give more priority to some type of queries over another. Denodo Embedded MPP makes use of Resource Groups to organize how different workloads are prioritized.

Resource Groups manage quotas for two main resources: CPU and memory. In addition, granular resource constraints such as concurrency, time, and cpuTime can be specified.

Resource Group limits are only enforced during admission. Once the query starts executing, the Resource Group manager has no control over it. Instead, the concept of penalty is used for groups that exceed their resource specification, checking if a Resource Group has exceeded its limit before allowing it to start a new query.

To enable Resource Management in the Denodo Embedded MPP, set presto.coordinator.resourceGroups: true in the prestocluster/values.yaml file.

Example

In the example configuration shown below, the one that is distributed (prestocluster/presto/conf/resource_groups.json, which you should modify according to your needs), there are several Resource Groups and four selectors that define which queries are excecuted in which Resource Group:

  • The first selector matches queries from datascience and places them in the global.adhoc.datascience group.

  • The second selector matches queries from bi and places them in the global.adhoc.bi group.

  • The third selector matches queries from pipeline and places them in the global.pipeline group.

  • The fourth selector matches queries from admin and places them in the admin group.

  • The last selector is a catch-all for all queries that have not been matched into any group.

Together, these selectors implement the following policy:

  • The admin group can run up to 50 concurrent queries (hardConcurrencyLimit).

For all other groups:

  • No more than 100 total queries may run concurrently (hardConcurrencyLimit).

  • Ad-hoc queries, such as queries from BI tools can run up to 10 concurrent queries (hardConcurrencyLimit) and queries from Data Science tools can run up to 2 concurrent queries (hardConcurrencyLimit).

  • Non ad-hoc queries will run under the global.pipeline group, with a total concurrency of 45 (hardConcurrencyLimit). Queries are run in FIFO order.

  • All remaining queries are placed in the global group.

{

  "rootGroups": [

    {

      "name": "global",

      "softMemoryLimit": "80%",

      "hardConcurrencyLimit": 100,

      "maxQueued": 1000,

      "schedulingPolicy": "weighted",

      "jmxExport": true,

      "subGroups": [

        {

          "name": "adhoc",

          "softMemoryLimit": "10%",

          "hardConcurrencyLimit": 50,

          "maxQueued": 1,

          "schedulingWeight": 10,

          "subGroups": [

            {

              "name": "datascience",

              "softMemoryLimit": "10%",

              "hardConcurrencyLimit": 2,

              "maxQueued": 1,

              "schedulingWeight": 10,

              "schedulingPolicy": "weighted_fair"

            },

            {

              "name": "bi",

              "softMemoryLimit": "10%",

              "hardConcurrencyLimit": 10,

              "maxQueued": 100,

              "schedulingWeight": 10,

              "schedulingPolicy": "weighted_fair"

            }

          ]

        },

        {

          "name": "pipeline",

          "softMemoryLimit": "80%",

          "hardConcurrencyLimit": 45,

          "maxQueued": 100,

          "schedulingWeight": 1,

          "jmxExport": true

        }

      ]

    },

    {

      "name": "admin",

      "softMemoryLimit": "100%",

      "hardConcurrencyLimit": 50,

      "maxQueued": 100,

      "schedulingPolicy": "query_priority",

      "jmxExport": true

    }

  ],

  "selectors": [

    {

      "source": "datascience",

      "group": "global.adhoc.datascience"

    },

    {

      "source": "bi",

      "group": "global.adhoc.bi"

    },

    {

      "source": "pipeline",

      "group": "global.pipeline"

    },

    {

      "source": "admin",

      "group": "admin"

    },

    {

      "group": "global"

    }

  ],

  "cpuQuotaPeriod": "1h"

}

prestocluster/presto/conf/resource_groups.json

For Denodo to take advantage of the Denodo Embedded MPP's Resource Group mechanism, you could create different data sources in Denodo for the same Denodo Embedded MPP and select the corresponding Resource Group (source) using the applicationNamePrefix driver properties.

embedded_mpp dataSource driver properties

Deployment

The cluster.sh script, in the root folder of the distribution, automates the entire deployment process.

Running on Windows

  • You need a Bash compatible shell such as Cygwin or Git Bash installed or use Windows Subsystem for Linux (WSL).

  • Check if the environment variable HADOOP_HOME is set on this computer. Since Hadoop is required by cluster.sh to transparently manage the encryption of all user-provided credentials.

            If HADOOP_HOME is not set:

  1. Create a directory. For example, <DENODO_HOME>\hadoop_win_utils.

  1. Create a directory named bin inside the new directory. For example, <DENODO_HOME>\hadoop_win_utils\bin.

  1. Set the environment variable HADOOP_HOME to point to <DENODO_HOME>\hadoop_win_utils.

  1. Copy the content of the <DENODO_HOME>\dll\vdp\winutils directory to %HADOOP_HOME%\bin.

The cluster.sh script has these prerequisites:

  1. Create the secret to authenticate to the Denodo's Harbor Registry.

The alternative to creating the secret is to add the credentials to the Container registry to the image.pullCredentials section and let the Helm chart handle the creation of the secret itself.

See the Container Registry section.

  1. kubectl, Helm v3 and Java are required.

Let's see now how to deploy the Denodo Embedded MPP!

The Denodo Embedded MPP is distributed as a Helm chart. However, we still provide the cluster.sh script on top of Helm that transparently manages the encryption of all credentials supplied by the user. To do this, the Hadoop credential provider framework is used to create a keystore file, creds.jceks, to avoid using clear values (e.g. the S3 Access Key Id and the S3 Secret Access Key).

cluster.sh deploy [OPTIONS]

COMMON OPTIONS

--presto-password: This option sets the password for the presto user, instead of using the default: pr3st%

If this password is not specified in the command line, cluster.sh deploy will prompt for it, keeping passwords out of the bash history. Otherwise, the presto.coordinator.passwordAuth.prestoPassword supplied in the values.yaml file will be used.

--credstore-password: Password to protect the encrypted credentials file, creds.jceks, which will store all the credentials supplied by the user.

If this password is not specified in the command line, cluster.sh deploy will prompt for it, keeping passwords out of the bash history.

--metastore-password: Password for the database of the embedded Hive Metastore.

If this password is not specified in the command line, cluster.sh deploy will prompt for it, keeping passwords out of the bash history. Otherwise, the metastore.connectionPassword supplied in the values.yaml file will be used.

--register: With this option the deployment process includes a final step that creates a special data source in Denodo called ‘embedded_mpp’. It also configures the Denodo query optimizer to use the Denodo Embedded MPP to accelerate queries.

This feature requires the Denodo subscription bundle Enterprise Plus and the Denodo 8.0 20230301 Update.

Object Storage credentials

The Denodo Embedded MPP must have a set of credentials with sufficient privileges to access all Object Storage data that is planned to access from Denodo. This may require the creation of a new role, principal or set of credentials for the Embedded MPP.

This is not a security problem, as Denodo should be the only client with access to the Denodo Embedded MPP and the control to the different Object Storage locations is managed in Denodo.

Amazon S3 credentials

There are two options to deploy a Denodo Embedded MPP that will access S3 datasets:

  1. The recommended one: provide no AWS S3 credentials to the cluster.sh deploy command.

Used when the Denodo Embedded MPP will run in AWS EKS and will access S3 with one of these two methods:

  1.  IAM Roles for Service Accounts: When the Denodo Embedded MPP runs on Amazon EKS you can associate an IAM role with the serviceAccount through the serviceAccount.annotations in the values.yaml, e.g:

eks.amazonaws.com/role-arn: arn:aws:iam::<aws_account_id>:role/your-role

and the Denodo Embedded MPP will access S3, using the permissions configured in that IAM role.

  1.  IAM EC2 instance profile: No changes required in values.yaml.

cluster.sh deploy --credstore-password xxx

  1. Provide AWS S3 credentials to the cluster.sh script:

  • --s3-access-key: the Access Key Id

  • --s3-secret-access: the Secret Access Key

If this secret is not specified in the command line, cluster.sh deploy will prompt for it, keeping secrets out of the bash history

cluster.sh deploy --s3-access-key xxx --s3-secret-access yyy --credstore-password zzz

These are the AWS privileges required by the Embedded MPP when accessing the AWS S3 buckets:

  • Reading from AWS S3:
  • s3:GetObject
  • s3:ListBucket

  • Writing to AWS S3. Same as for reading and also:
  • s3:PutObject
  • s3:DeleteObject

AWS credentials provider

The Denodo Embedded MPP is shipped with a credentials provider chain configured by default: DenodoAWSCredentialsProviderChain.

This chain looks for AWS credentials in this order:

  • SimpleAWSCredentialsProvider: Loads credentials from fs.s3a.access.key and fs.s3a.secret.key properties in Hadoop configuration files.

  • EnvironmentVariableCredentialsProvider: Loads credentials from environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_KEY.

  • SystemPropertiesCredentialsProvider: Loads credentials from Java system properties aws.accessKeyId and aws.secretKey.

  • WebIdentityTokenCredentialsProvider: Loads Web Identity Token credentials from the environment or container.

  • ProfileCredentialsProvider: Load credential profiles file at the default location ~/.aws/credentials.

  • EC2ContainerCredentialsProviderWrapper: Loads credentials from EC2, typically using the InstanceProfileCredentialsProvider.

If none of these providers fit your needs, you will need to change the credentials provider configured in:

  1. prestocluster/presto/conf/catalog/core-site.xml:
  1. Replace the value of the presto.s3.credentials-provider property with the AWS credentials provider of your choice.
  2. Include any other properties required by this credential provider.

  1. prestocluster/hive-metastore/conf/core-site.xml:
  1. Replace the value of the fs.s3a.aws.credentials.provider property with the AWS credentials provider of your choice.
  2. Include any other properties required by this credential provider.

S3-compatible storage

S3-compatible storage allows access to and management of the data it stores over an S3 compliant interface, e.g. MinIO, IBM Cloud Object Storage, Dell EMC ECS,...

In order to access datasets stored in S3-compatible storage, you have to configure the following parameters in prestocluster/values.yaml file:

  • presto.hive.s3Endpoint: The storage endpoint server. It can be found in the documentation of the S3-compatible storage.
  • presto.hive.s3PathStyleAccess: true. Enable S3 path-style access by eliminating the need to configure DNS for virtual hosting.

Then, you have to provide credentials to the cluster.sh script:

  • --s3-access-key: the Access Key Id

  • --s3-secret-access: the Secret Access Key

If this secret is not specified in the command line, cluster.sh deploy will prompt for it, keeping secrets out of the bash history

cluster.sh deploy --s3-access-key xxx --s3-secret-access yyy --credstore-password zzz

Azure Data Lake Gen 2 credentials

There are three options to deploy a Denodo Embedded MPP that will access Data Lake Storage gen2 datasets:

  1. The recommended one: provide no credentials to the cluster.sh deploy command.

Used when the Denodo Embedded MPP will run in Azure AKS and will access Data Lake Storage gen2 using Azure Managed Identities. For this you need to add the following properties to the presto/conf/catalog/core-site.xml and hive-metastore/conf/core-site.xml, before the Embedded MPP is deployed:

<property>

  <name>fs.azure.account.auth.type</name>

  <value>OAuth</value>

</property>

<property>

  <name>fs.azure.account.oauth.provider.type</name>

  <value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>

</property>

<property>

  <name>fs.azure.account.oauth2.msi.tenant</name>

  <value>MSI Tenant ID</value>

</property>

<property>

  <name>fs.azure.account.oauth2.msi.endpoint</name>

  <value>http://169.254.169.254/metadata/identity/oauth2/token

</value>

</property>

<property>

  <name>fs.azure.account.oauth2.client.id</name>

  <value>Client ID</value>

</property>

  1. Provide the Azure OAuth2 client credentials. For this you need to add the following properties to the presto/conf/catalog/core-site.xml and hive-metastore/conf/core-site.xml, before the Embedded MPP is deployed:

<property>

  <name>fs.azure.account.auth.type</name>

  <value>OAuth</value>

</property>

<property>

  <name>fs.azure.account.oauth.provider.type</name>

  <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>

</property>

<property>

  <name>fs.azure.account.oauth2.client.endpoint</name>

 <value>https://login.microsoftonline.com/<directory_id>/oauth2/token</value>

</property>

<property>

  <name>fs.azure.account.oauth2.client.id</name>

  <value>Client ID</value>

</property>

<property>

  <name>fs.azure.account.oauth2.client.secret</name>

  <value>Secret</value>

</property>

  1. Provide the Azure credentials for the Shared Key authentication method to the cluster.sh script:

  • --abfs-storage-account: the name of the Storage Account

  • --abfs-storage-key: the access key that protects access to your Storage Account

If this access key is not specified in the command line, cluster.sh deploy will prompt for it, keeping access keys out of the bash history

cluster.sh deploy --abfs-storage-account xxx --abfs-storage-key yyy --credstore-password zzz

Azure Data Lake Gen 1 credentials

Legacy

On Feb 29, 2024 Azure Data Lake Storage Gen1 will be retired. If you use Azure Data Lake Storage Gen1, make sure to migrate to Azure Data Lake Storage Gen2 prior to that date.

There are two options to deploy a Denodo Embedded MPP that will access Data Lake Storage gen1 datasets:

  1. The recommended one: provide no credentials to the cluster.sh deploy command.

Used when the Denodo Embedded MPP will run in Azure AKS and will access Data Lake Storage gen1 using Azure Managed Identities. For this you need to add the following properties to the presto/conf/catalog/core-site.xml and hive-metastore/conf/core-site.xml, before the Embedded MPP is deployed:

<property>

  <name>fs.adl.oauth2.access.token.provider.type</name>

  <value>Msi</value>

</property>

<property>

  <name>fs.adl.oauth2.msi.port</name>

  <value>PORT NUMBER for the REST endpoint of the token service exposed to localhost by the identity extension in the VM, (if different from the default of 50342)</value>

</property>

  1. Provide the Azure OAuth2 client credentials to the cluster.sh script:

  • --adl-client-id: Client id

  • --adl-client-secret: OAuth2 refresh token from the Azure Active Directory service associated with the client id.

If this secret is not specified in the command line, cluster.sh deploy will prompt for it, keeping secrets out of the bash history

  • --adl-tenant-id: id of the application's Azure Active Directory tenant

cluster.sh deploy --adl-client-id www --adl-client-secret xxx --adl-tenant-id yyyy --credstore-password zzz

Azure Blob Storage credentials

Legacy

The legacy Windows Azure Storage Blob driver (WASB) has been deprecated. Azure Data Lake Storage Gen2 (ABFS) has numerous benefits over WASB that recommend migration.

Provide the Azure credentials to the cluster.sh script:

  • --wasb-storage-account: the name of the Storage Account

  • --wasb-storage-key: the access key that protects access to your Storage Account

If this access key is not specified in the command line, cluster.sh deploy will prompt for it, keeping access keys out of the bash history

cluster.sh deploy --wasb-storage-account xxx --wasb-storage-key yyy --credstore-password zzz

Google Cloud Storage credentials

For accessing the GS datasets you have to:

  • Place the key JSON file for the Google Cloud service account in 'prestocluster/presto/secrets' and in 'hive-metastore/secrets' directories

  • Place the key JSON file name in hive.gcsKeyFile in the values.yaml.

cluster.sh deploy --credstore-password zzz

Denodo Embedded MPP at Denodo

Since the Update 8.0u20230301, Denodo has customized Presto to behave as the Denodo Platform's embedded Massively Parallel Processing (MPP) to accelerate queries.

This feature requires the Denodo subscription bundle Enterprise Plus and the Denodo 8.0 20230301 Update.

And if you have a cluster of Denodo servers it needs to be configured to store its metadata in an external database to take full advantage of the Denodo Embedded MPP functionalities.

To use the Denodo Embedded MPP you have to register it in Denodo. The registration process can be executed:

  • within the deployment process: –register option in the cluster.sh deploy command.

  • after the deployment process: cluster.sh register --register-user --register-password [--presto-password] command.

The registration process consists of:

  1. Creation of a new database admin_denodo_mpp, a new user denodo_mpp_user and a special data source in Denodo called embedded_mpp:

embedded_mpp datasource

This embedded_mpp data source allows you to connect to an Object Storage (S3, HDFS) graphically, explore its data and create base views on Parquet files and the corresponding Hive tables in the Denodo Embedded MPP. A functionality similar to the Metadata Discovery Tool for Parquet files but now integrated in the Denodo Platform.

For the embedded_mpp to connect to the Object Storage you have to manually configure the Object storage configuration in the Read & Write tab. You must select the file system you want to access and provide the credential information. The file systems available graphically are S3 and HDFS. You can use other systems like Azure Data Lake Storage that are compatible with the Hadoop API. To do this, select HDFS and provide the required Hadoop properties (see Support for Hadoop-compatible routes section).

embedded_mpp Object Storage configuration

You can then add the paths you want to browse from that object storage. Once you have saved the necessary credentials and paths, you can click on Create Base View to browse these paths and select the ones you want to import. Denodo automatically detects the folders corresponding to tables in Parquet format (including those using the Hive style partitioning).

embedded_mpp base view creation

Select the tables to import and click on Create selected to create the base view. Denodo will create the base view and a table in the embedded_mpp data source to access the data. Denodo will create the table in the Hive catalog and the schema of your choice. You can select the schema from those available in the Target schema drop-down at the bottom of the Create Base View dialog.

For each table Denodo will automatically calculate its statistics by calling the stored procedure COMPUTE_SOURCE_TABLE_STATS. This stored procedure collects statistical information about the data that the Denodo Embedded MPP optimizer uses to plan the query based on cost strategies.

  1. Configuration of the Denodo query optimizer to consider this Denodo Embedded MPP for query acceleration.

Denodo Embedded MPP acceleration

This is useful in scenarios where a query combines large amounts of Parquet data stored in an Object Storage such as HDFS, S3 or Azure Data Lake with data in a different data source. In these cases, the Denodo query optimizer may decide to send the query to the Denodo Embedded MPP. The Denodo Embedded MPP can access the data in Object Storage using its own engine and can access the data outside the Object Storage in streaming through Denodo, without the need to create temporary tables or files.

In this way, the Denodo query engine can combine its powerful optimization techniques, federation and security capabilities with parallel processing on big data.

Requirements to connect to Denodo Embedded MPP from Denodo

Note that if the certificate used by the Denodo Embedded MPP is signed by a private authority, or it is self-signed, you have to import the Denodo Embedded MPP certificate into the Denodo server trust store.

Note: Since Denodo 8.0 Update 7 this import is not required when using the self-signed certificate and the Denodo Cloud Marketplace images or Denodo servers created through the Solution Manager Automated Mode with the images provided by Denodo, as the images generated with Denodo 8.0 Update 7 include the self-signed certificate, certs/certificate.crt distributed with the Denodo Embedded MPP.

The certs/certificate.crt is distributed for testing purposes ONLY. This certificate accepts presto-denodo as the Denodo Embedded MPP hostname. In this case, you have to add an entry in the hosts file where the Denodo server is running, with presto-denodo and the IP that appears as the EXTERNAL-IP of the presto service.

Kubernetes services status

/etc/hosts file

Bulk Data Load

The Denodo Embedded MPP data source can be used to perform bulk data loads from Denodo.

In this case, Denodo:

  1. First generates temporary files containing the data to insert in Parquet file format (Delta Lake and Iceberg table formats not supported),

  1. Then uploads those files to the specific path configured in this section.

  1. Finally, Denodo will make the necessary operations to make sure the database table takes the data from the path provided.

For more information see Bulk Data Load on a Distributed Object Storage like HDFS or S3 at Denodo.

Before setting up the Bulk Data Load in Denodo you have to create a new schema in the Denodo Embedded MPP that sets the location where the parquet files created by Denodo will be placed.

To do this you can use the Denodo stored procedure CREATE_SCHEMA_ON_SOURCE:

CALL CREATE_SCHEMA_ON_SOURCE('admin_denodo_mpp', 'embedded_mpp', 'hive', 'test', '<filesystem_schema>://<host>/<folders>');

To configure the Bulk Data Load, check Use Bulk Data Load APIs of the embedded_mpp data source in its Read & Write tab, and fill in the parameters: Server time zone, HDFS URI (adl://<adl_host>/<dirs> in the example), and the Catalog and Schema:

Then, depending on the chosen file system, you may need to add some Hadoop properties to configure authentication (see Support for Hadoop-compatible routes section).

In the example, properties to configure access to Azure Data Lake Storage are:

Finally, click the Test bulk load button to check that everything is working fine:

When using S3 storage, you must set the Denodo server property 'com.denodo.vdb.util.tablemanagement.sql.PrestoTableManager.useExternalTables' to 'false' if you receive the error: External location must be a directory.

Cost Based Optimization

The Denodo cost-based optimization process needs the statistics of at least all the base views involved in the query (the average size of a field, the maximum and minimum values, the number of distinct values, etc.) to estimate the cost of possible execution plans. Statistics allow Denodo to decide whether to apply optimizations such as join reordering, choosing a join execution method (merge, hash, nested,...), applying ‘aggregation pushdown’, or performing data movement automatically. It is therefore strongly recommended to gather statistics for each base view created in Denodo Embedded MPP.

Upgrade from previous versions

The process of upgrading to a new version of the Denodo Embedded MPP involves transferring the metadata from the installed version to the new one.

In case you were using the embedded PostgreSQL, it is necessary to save the metadata of the old cluster by executing a dump over the PostgreSQL database of the embedded Hive Metastore, before deleting the cluster. This dump generates a text file with SQL commands that, when fed back into the embedded Hive Metastore of the new cluster, will recreate the database in the same state as it was at the time of the dump: keeping all the metadata of the Denodo Embedded MPP, that is, the datasets exposed as tables.

Dump in the old cluster:

$ kubectl exec <old PostgreSQL Pod> -- bash -c "PGPASSWORD=hive pg_dump -c -U hive -h localhost metastore" > database.sql

Restore to the new cluster:

$ cat database.sql | kubectl exec -i <new PostgreSQL Pod> -- bash -c "PGPASSWORD=hive psql -U hive -h localhost -d metastore"

✸  Note that restoring the database is the first thing to do before creating new tables in the Denodo Embedded MPP.

Cleanup

Do not forget to run cluster.sh delete to delete all the resources you have created, when you no longer need the cluster.

In case you are using the embedded PostgreSQL, the delete command destroys the metadata persistence, that is, the table definitions.

If you want to save the cluster metadata for later deployment, you need to run a dump on the PostgreSQL database of the embedded Hive Metastore, before deleting the cluster. This dump generates a text file with SQL commands that, when fed back to the embedded Hive Metastore, will recreate the database in the same state as it was at the time of the dump.

Dump in the old cluster:

$ kubectl exec <PostgreSQL Pod> -- bash -c "PGPASSWORD=hive pg_dump -c -U hive -h localhost metastore" > database.sql

Restore to the new cluster:

$ cat database.sql | kubectl exec -i <PostgreSQL Pod> -- bash -c "PGPASSWORD=hive psql -U hive -h localhost -d metastore"

✸   Note that restoring the database is the first thing to do before creating new tables in the Denodo Embedded MPP.

Troubleshooting Tips

  • Query exceeded per-node user memory limit of xxx.xx.

Cause: The max amount of user memory a query can use on a worker, the query.max-memory-per-node, has as default value the JVM max memory * 0.1.  The JVM max memory that the Embedded MPP allocates is the 80% of the memory machine (the memoryPerNode configured in the values.yaml). But this error means that your Embedded MPP needs more memory than JVM max memory * 0.1 to handle queries.

Solution: You should increase the available memory by configuring the memory settings, query.max-memory-per-node, query.max-total-memory-per-node and   query.max-memory.

For more information on how to configure the memory settings see the Memory Settings section.

In addition to adjusting the memory settings, sometimes, the only solution to handle large queries is to use instances with more memory or adding more nodes to the cluster.

  • Abandoned queries: Query ... has not been accessed since…

Cause: This means that the client of the Embedded MPP, that is Denodo, is not processing the query results or is processing them slowly, so the Embedded MPP assumes that the client has left.

Solution: You can increase query.client.timeout for the MPP coordinator, default value is 5 minutes, 5.00m, in the values.yaml, element: presto.coordinator.additionalConfig, and restart the Embedded MPP:

  # -- Entries for extra config properties

  additionalConfig: [

    query.client.timeout=10.00m

  ]

But, in most cases, this is an indication that you should review your query to identify where the bottleneck is and take actions to improve your query performance: Detecting Bottlenecks in a Query.

  • PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

when connecting to the Embedded MPP from the Denodo server.

Cause: The server certificate you are trying to connect is missing from the truststore of the client's JVM. The problem is that the server certificate is self-signed (a CA did not sign it) or it is signed by a private authority that does not exist within the client's  truststore.

Solution: Make sure you have imported the certificate of the Denodo Embedded MPP into the Denodo server's truststore. See the instructions in the SSL/TLS section.

  • KeyStores with multiple certificates are not supported.

Cause: The error occurs when a certificate with multiple SubjectAlternativeName (SAN) elements is used:

    SubjectAlternativeName [

      DNSName: *.acme.com

      DNSName: acme.com

      DNSName: b.acme.com

    ]

Solution: Use a certificate with a single element in the SubjectAlternativeName (SAN) field, as shown below:

    SubjectAlternativeName [

      DNSName: *.acme.com

    ]

  • Unable to establish connection: Unrecognized connection property 'protocols'.

Cause: Denodo is loading an older version of the Presto driver.

Solution: Remove Presto driver backups from $DENODO_HOME/lib/extensions/jdbc-drivers/presto-0.1x, leaving only the presto-jdbc.jar.

  • helm install throws 'Cannot list resource "pods" in API group "" in the namespace "kube-system"'.

Cause: Helm v2 is being used.

Solution: Upgrade to Helm v3.

  • org.apache.parquet.io.PrimitiveColumnIO cannot be cast to class org.apache.parquet.io.GroupColumnIO reading Hive tables with complex/compound structures.

Cause: The Hive table schema is not compatible with the Parquet schema.

Solution: Check the schema in the Parquet files and compare it with the schema declared in the Hive table in the Embedded MPP.

There are multiple tools available to inspect the schema of a Parquet file. One of the most common is called parquet-tools.

Appendix A: Other uses of Denodo Embedded MPP

Since the Update 8.0u20230301, Denodo has customized Presto to behave as the Denodo Platform's embedded Massively Parallel Processing (MPP) to accelerate queries, that is, the Denodo Embedded MPP.

But this feature requires the Denodo subscription bundle Enterprise Plus and the Denodo 8.0 20230301 Update. If you do not have this Denodo subscription you can also deploy the Denodo Embedded MPP and connect to it from Denodo, but without benefiting from Denodo's optimization capabilities.

Connect from Denodo

To establish a connection from Denodo to the Denodo Embedded MPP you have to create a JDBC data source and fill in all the required fields:

  • Name
  • Database adapter: PrestoDB
  • Driver class path: presto-0.1x
  • Driver class: com.facebook.presto.jdbc.PrestoDriver
  • Database URI: jdbc:presto://presto-denodo:8443/hive/default?SSL=true&protocols=http11
  • Login: presto
  • Password: pr3st% (or the custom one if specified in the deployment process: cluster.sh deploy)

Requirements to connect to Denodo Embedded MPP from Denodo

Note that if the certificate used by the Denodo Embedded MPP is signed by a private authority, or it is self-signed, you have to import the Denodo Embedded MPP certificate into the Denodo server trust store.

Note: Since Denodo 8.0 Update 7 this import is not required when using the self-signed certificate and the Denodo Cloud Marketplace images or Denodo servers created through the Solution Manager Automated Mode with the images provided by Denodo, as the images generated with Denodo 8.0 Update 7 include the self-signed certificate, certs/certificate.crt distributed with the Denodo Embedded MPP.

The certs/certificate.crt is distributed for testing purposes ONLY. This certificate accepts presto-denodo as the Denodo Embedded MPP hostname. In this case, you have to add an entry in the hosts file where the Denodo server is running, with presto-denodo and the IP that appears as the EXTERNAL-IP of the presto service.

Kubernetes services status

/etc/hosts file

Presto datasource in Denodo

Once the data source is configured, click on Create base View to explore the Presto schemas with their tables and their columns.

Presto tables in Denodo

You can now run any query you want on your datasets.

Query results in Denodo

Bulk Data Load

The Presto data source can be used to perform bulk data loads from Denodo.

In this case, Denodo:

  1. First generates temporary files containing the data to insert in Parquet file format (Delta Lake and Iceberg formats not supported),

  1. Then uploads those files to the specific path configured in this section.

  1. Finally, Denodo will make the necessary operations to make sure the database table takes the data from the path provided.

For more information see Bulk Data Load Databases using Hadoop-compatible storage like in the MPP at Denodo.

Before setting up the Bulk Data Load in Denodo, you have to create a new schema in Presto that sets the location where the parquet files created by Denodo will be located.

To do this you can use the Denodo stored procedure CREATE_SCHEMA_ON_SOURCE:

CALL CREATE_SCHEMA_ON_SOURCE('admin_denodo_mpp', 'embedded_mpp', 'hive', 'test', '<filesystem_schema>://<host>/<folders>');

To configure the Bulk Data Load, check Use Bulk Data Load APIs of the Denodo Embedded MPP data source in its Read & Write tab, and fill in the parameters: Server time zone, HDFS URI (adl://<adl_host>/<dirs> in the example), and the Catalog and Schema:

Then, depending on the chosen file system, you may have to add some Hadoop properties to configure authentication (see Support for Hadoop-compatible routes section).

In the example, properties to configure access to Azure Data Lake Storage are:

Finally, click the Test bulk load button to check that everything is working fine:

When using S3 storage, you must set the Denodo property 'com.denodo.vdb.util.tablemanagement.sql.PrestoTableManager.useExternalTables' to 'false' if you receive the error: External location must be a directory.

Metadata Discovery Tool

The metadata discovery tool scans the distributed storage, starting from a folder given by the user, and for each Paquet dataset it finds, in a folder or subfolder, creates the corresponding Hive table in Presto and the corresponding base view in Denodo.

The metadata discovery tool can create Presto tables from partitioned datasets that follow the Hive naming convention for folders, i.e.: columnName=value, elb_name=abc.

.../weblogs/elb_name=abc/elb_response_code=200/…

.../weblogs/elb_name=abc/elb_response_code=404/…

.../weblogs/elb_name=xyz/elb_response_code=200/…

The metadata discovery tool invokes sync_partition_metadata automatically, after each partition table creation.

The tool is distributed as a Denodo stored procedure and as a script.

Stored Procedure

The DISCOVER_PRESTO_METADATA stored procedure is distributed in the folder metadata-discovery/sp/denodo-presto-k8scluster-{version}-jar-with-dependencies.jar.

In order to use this store procedure in Denodo, first, you have to import the file denodo-presto-k8scluster-{version}-jar-with-dependencies.jar using  the menu option File > Extension management in VDP Administration.

Secondly, you have to create the stored procedure by clicking the menu option File > New > Stored procedure in the VDP Administration, and provide the following values:

Name: a name

Class name: com.denodo.connect.presto.MetadataDiscoveryStoredProcedure

Select Jars: denodo-discover-presto-metadata

Finally, you can execute the stored procedure:

  • Using the Execute button in the dialog that displays the schema of the stored procedure. Denodo will display a dialog where you have to enter the input values.

  • Using the VQL Shell:

SELECT * FROM discoverprestometadata()

WHERE dir_to_explore='s3a://bucket/dir/'

and credentials_path='/absolute/path/to/core-site.xml'

and datasource_name='the_presto_datasource'

and duplicate_strategy='NEVER_REPLACE'

and included_pattern=NULL

and baseview_folder='/01 - Connectivity/02 - Base Views'

and baseview_prefix='bv_'

and analyze_tables=true

The input parameters required by the stored procedure are:

  • dir_to_explore: directory of the distributed storage path (s3a, adl, wasb, abfs), to start the metadata discovery process.

  • datasource_name: the name of a Presto data source already created in Denodo (for more information see Connect from Denodo section).

The base views created by this stored procedure will belong to this Presto data source.

The following input parameters are optional:

  • credentials_path: absolute path to core-site.xml with the distributed storage credentials.

  • duplicate_strategy: chooses what to do with the metadata that already exists in Presto and Denodo. Possible values are NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, ALWAYS_REPLACE. If no value is provided for the parameter  NEVER_REPLACE is selected.

  • NEVER_REPLACE: It will not replace the tables in Presto and base views at Denodo if they already exist.

  • REPLACE_IF_NEW_COLUMNS: It will replace the tables in Presto, and their corresponding base views in Denodo, if new columns are detected for the existing Presto tables.

This may affect existing Denodo base views that have been modified by the user, such as renamed fields, modifications in field type, cache configuration... since the base views will be replaced and this kind of changes will be lost.

Our recommendation is to always leave Denodo base views as they are and make changes only in the views that derive from these base views.

  • ALWAYS_REPLACE: It will replace the tables in Presto, and their corresponding base views in Denodo, if any kind of change is detected in the existing tables in Presto: new columns, deleted columns, different types of columns, etc.

This may affect the existing base views in Denodo that have been modified by the user, such as renamed fields, modifications in field type, cache configuration... since the base views will be replaced and this kind of changes will be lost.

Our recommendation is to always leave Denodo base views as they are and make changes only in views that derive from these base views.

  • included_pattern: a regular expression that will select only certain files or folders in the scanning process. For example: (.*)invoices(.*)

  • baseview_folder: all base views created by the stored procedure will be placed in this folder when this parameter is specified. The folder will be created by the stored procedure if necessary.

  • baseview_prefix: all base views created by the stored procedure will be prefixed with this value when this parameter is specified. When this parameter is not set and a base view with an invalid name is to be created, the prefix "bv" will be added.

  • analyze_tables: whether or not to execute the ANALYZE command after each table creation.

This command collects statistical information about the data that the Presto optimizer uses to plan the query, based on cost strategies.

The output of this stored procedure is EXECUTED_COMMAND, which shows the DDL statements executed successfully in Presto and Denodo:

  • the SQL of each tables created in Presto

  • the  VQL of each Presto wrapper and base view created in Denodo

Script

The metadata discovery script is distributed in the metadata-discovery/bin folder. A configuration file is required as input:

$ denodo-presto-metadata-discovery.sh ../conf/configuration.properties

Configuration file

dirToExplore=s3a://example-bucket/dir/

credentials.xml.path=../conf/core-site.xml

#includedPattern=(.*)blahblah(.*)

# Denodo connection properties

denodo.jdbcUrl=jdbc:vdb://localhost:9999/presto_tests

denodo.username=admin

denodo.password=admin

# Presto connection properties

#     Requirements for this code to connect to Presto:

#     If the certificate used by the Presto server is signed by a private authority:

#     - import certificate.crt in the TrustStore of the JVM

#       (the one that runs this code and the Denodo's JVM)

#     OR

#     - use the JDBC URI parameters (SSLTrustStorePath and

#       SSLTrustStorePassword) to provide the path

#       to a TrustStore with the certificate.crt

presto.jdbcUrl=jdbc:presto://localhost:8443/hive/default?SSL=true\

&SSLTrustStorePath=truststore-path&SSLTrustStorePassword=truststore-pwd

presto.username=presto

presto.password=pr3st%

# Default values

denodo.datasource.name=ds_presto_k8s

denodo.datasource.folder=/01 - Connectivity/01 - Data Sources

denodo.baseview.folder=/01 - Connectivity/02 - Base Views

denodo.baseview.prefix=bv_

presto.analyze.tables=true

# Possible values: NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, REPLACE_ALWAYS

metadata.duplicated.strategy=NEVER_REPLACE

configuration.properties

  • dirToExplore: directory of the distributed storage path (s3a, adl, wasb, abfs), to start the metadata discovery process.

  • credentials.xml.path: core-site.xml with the distributed storage credentials.

  • includedPattern: a regular expression that will select only certain files or folders in the scanning process. For example: (.*)invoices(.*).

  • Denodo connection properties: JDBC URI, user and password.

  • Presto Connection properties: JDBC URI, user and password.

  • denodo.datasource.name: the name of the Presto data source that will be created in Denodo by this script.

  • denodo.datasource.folder: the Presto data source created in Denodo by the script will be placed in this folder when this parameter is specified. The folder will be created by the script if needed.

  • denodo.baseview.folder: all base views created by the stored procedure will be placed in this folder when this parameter is specified. The folder will be created by the script if necessary.

  • denodo.baseview.prefix: all base views created by the stored procedure will be prefixed with this value when this parameter is specified. When this parameter is not set and a base view with an invalid name is to be created, the prefix "bv" will be added.

  • presto.analyze.tables: whether or not to execute the ANALYZE command after each table creation. This command collects statistical information about the data that the Presto optimizer uses to plan the query, based on cost strategies.  Default is true.

  • metadata.duplicated.strategy: chooses what to do with the metadata that already exists in Presto and Denodo. Possible values are NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, ALWAYS_REPLACE. If no value is provided for the parameter  NEVER_REPLACE is selected.

  • NEVER_REPLACE: It will not replace the tables in Presto and data source and base views at Denodo if they already exist.

  • REPLACE_IF_NEW_COLUMNS: It will replace the tables in Presto, and their corresponding base views in Denodo, if new columns are detected for the existing Presto tables.

This may affect existing Denodo base views that have been modified by the user, such as renamed fields, modifications in field type, cache configuration... since the base views will be replaced and this kind of changes will be lost.

Our recommendation is to always leave Denodo base views as they are and make changes only in the views that derive from these base views.

  • ALWAYS_REPLACE: It will replace the tables in Presto, and their corresponding base views in Denodo, if any kind of change is detected in the existing tables in Presto: new columns, deleted columns, different types of columns, etc.

This may affect the existing base views in Denodo that have been modified by the user, such as renamed fields, modifications in field type, cache configuration... since the base views will be replaced and this kind of changes will be lost.

Our recommendation is to always leave Denodo base views as they are and make changes only in views that derive from these base views.

Output

The output of the process would be:

  • the list of tables created in Presto

  • the Presto data source created in Denodo, with the presto.jdbcUrl configured as the Database URI of the data source in Denodo.

  • the Presto base views created in Denodo

METADATA DISCOVERY RESULTS

******************************************

Commands executed at Presto:

------------------------------------------

CREATE TABLE "customer" (

ANALYZE "customer"

CREATE TABLE "store" (

ANALYZE TABLE "store"

CREATE TABLE "store_sales" (

ANALYZE "store_sales"

CREATE TABLE "date_dim" (

ANALYZE "date_dim"

------------------------------------------

Commands executed at Denodo (data source):

--------------------------------------------------------------

CREATE FOLDER '/01 - Connectivity';

CREATE FOLDER '/01 - Connectivity/01 - Data Sources';

CREATE DATASOURCE JDBC "ds_presto_k8s"

------------------------------------------

Commands executed at Denodo (base views):

-------------------------------------------------------------

CREATE OR REPLACE FOLDER '/01 - Connectivity/02 - Base Views';

CREATE OR REPLACE WRAPPER JDBC bv_customer

CREATE OR REPLACE TABLE bv_customer I18N us_pst (

CREATE OR REPLACE WRAPPER JDBC bv_date_dim

CREATE OR REPLACE TABLE bv_date_dim I18N us_pst (

CREATE OR REPLACE WRAPPER JDBC bv_store

CREATE OR REPLACE TABLE bv_store I18N us_pst (

CREATE OR REPLACE WRAPPER JDBC bv_store_sales

CREATE OR REPLACE TABLE bv_store_sales I18N us_pst (

                        Output of the metadata discovery process

Cost Based Optimization

The Denodo cost-based optimization process needs the statistics of at least all the base views involved in the query (the average size of a field, the maximum and minimum values, the number of distinct values, etc.) to estimate the cost of possible execution plans. Statistics allow Denodo to decide whether to apply optimizations such as join reordering, choosing a join execution method (merge, hash, nested,...), applying ‘aggregation pushdown’, or performing data movement automatically. It is therefore strongly recommended that for each base view created in Presto:

  1. Calculate statistics by calling the Denodo stored procedure COMPUTE_SOURCE_TABLE_STATS

  1. Gather statistics in Denodo

Appendix B: Hive tables syntax and properties

To access Parquet datasets from Denodo Embedded MPP, you need to map the files from AWS S3, Azure Data Lake, etc.. to Hive tables in the Metastore. The mapping is done through the CREATE TABLE sentence, where you have to provide the data schema, file format, and data location.

CREATE TABLE orders (

    orderkey bigint,

    custkey bigint,

    orderstatus varchar,

    totalprice double,

    orderpriority varchar,

    clerk varchar,

    shippriority integer,

    comment varchar

) WITH (

     external_location = 's3a://my_bucket/path/to/folder/',

     format = 'PARQUET'

);

  • external_location: The data location must be a bucket name or a bucket name and one or more folders, not a specific file.

Do not forget to use a slash at the end, otherwise an error will occur: Can not create a Path from an empty string.

  • format: Supported file formats are:

  • ORC
  • PARQUET
  • AVRO
  • RCBINARY
  • RCTEXT
  • SEQUENCEFILE
  • JSON
  • TEXTFILE
  • CSV

The WITH clause of CREATE TABLE can also be used to set other properties on the table. The available table properties are:

  • avro_schema_url (varchar): URI pointing to Avro schema for the table.

  • bucket_count (integer): Number of buckets.

  • bucketed_by (array(varchar)): Bucketing columns.

  • csv_escape (varchar): CSV escape character.

  • csv_quote (varchar): CSV quote character.

  • csv_separator (varchar): CSV separator character.

  • dwrf_encryption_algorithm (varchar): Algorithm used for encryption data in DWRF.

  • dwrf_encryption_provider (varchar): Provider for encryption keys in provider.

  • encrypt_columns (array(varchar)): List of key references and columns being encrypted. Example: ARRAY['key1:col1,col2', 'key2:col3,col4'].

  • encrypt_table (varchar): Key reference for encrypting the whole table.

  • external_location (varchar): File system location URI for external table.

  • format (varchar): Hive storage format for the table. Possible values: [ORC, PARQUET, AVRO, RCBINARY, RCTEXT, SEQUENCEFILE, JSON, TEXTFILE, CSV].

  • orc_bloom_filter_columns (array(varchar)): ORC Bloom filter index columns.

  • orc_bloom_filter_fpp (double): ORC Bloom filter false positive probability.

  • partitioned_by (array(varchar)): Partition columns.

  • preferred_ordering_columns (array(varchar)): Preferred ordering columns for unbucketed table.

  • sorted_by (array(varchar)): Bucket sorting columns.

Partition Tables

Hive tables can be partitioned, as this can improve query performance, especially when the columns being queried are partitioned columns. This means that, for each column value of the partitioned column, there will be a separate folder under the dataset location.

The Denodo Embedded MPP can create tables from partitioned Parquet datasets that follow the Hive naming convention for folders, i.e.: columnName=value, elb_name=abc.

.../weblogs/elb_name=abc/elb_response_code=200/…

.../weblogs/elb_name=abc/elb_response_code=404/…

.../weblogs/elb_name=xyz/elb_response_code=200/…

CREATE TABLE IF NOT EXISTS weblogs (

    request_timestamp varchar,

    request_ip varchar,

    request_port integer,

    backend_ip varchar,

    backend_port integer,

    request_processing_time double,

    backend_processing_time double,

    client_response_time double,

    received_bytes bigint,

    sent_bytes bigint,

    request_verb varchar,

    url varchar,

    protocol varchar,

    elb_name varchar,

    elb_response_code varchar)

WITH (

    format = 'parquet',

    partitioned_by = ARRAY['elb_name', 'elb_response_code'],

    external_location = s3://my_bucket/path/to/folder/'

);

Note that Hive tables require the partition columns to be the last columns of the CREATE TABLE statement.

However, if we query a partition table after creating it, we will see that it returns zero results. The Hive Metastore needs to discover what partitions exist by querying the underlying storage system. The Presto procedure system.sync_partition_metadata(schema_name, table_name, mode) is responsible for detecting the existence of partitions.

When inserting data into new partitions we need to invoke the sync_partition_metadata procedure again, to discover the new records.

call system.sync_partition_metadata('default', 'weblogs', 'FULL', false);

The embedded_mpp data source in Denodo calls sync_partition_metadata transparently, after each partition table creation.

Appendix C: Delta Lake Tables Properties

The optional WITH clause of the CREATE TABLE can be used to set properties on the newly created table or on single columns.

Available table properties are:

  • format (varchar): File format for the table.

  • external_location (varchar): File system location URI for the table.

Appendix D: Iceberg Tables Properties

The optional WITH clause of the CREATE TABLE can be used to set properties on the newly created table or on individual columns.

The available table properties are:

  • format (varchar): File format for the table.

  • format_version (varchar): Format version for the table.

  • location (varchar): File system location URI for the table.

  • partitioning (array(varchar)): Partition transforms.