Presto is a high performance, distributed SQL query engine for big data. It was developed by Facebook, but now is open-source.
The main purpose of deploying Presto on Kubernetes is to access huge datasets in an efficient manner, using only SQL.
The steps to follow are:
This document focuses on datasets stored in AWS S3, as an example, but the Presto cluster can read data from many distributed storage systems, such as:
The architecture of the Presto cluster looks like this:
Presto cluster architecture
Presto uses its Hive Connector to access datasets stored in S3. This connector relies on Hive Metastore to map S3 data files to schemas and tables. The Hive Metastore saves the results of this mapping in an external RDBMS, that could be external to the cluster and provided by the user or a PostgreSQL internal to the cluster.
When the internal PostgreSQL is used, the cluster uses a Persistent Volume, to ensure the persistence of the metadata. This way, all the services of the cluster: Presto, Hive Metastore and the Hive's external database, can go down temporarily and we will not lose any of the table definitions that Presto rely upon.
Denodo Presto cluster is designed to run on Kubernetes, locally on Docker for Desktop or Minikube for example, or on the cloud with:
You can run Kubernetes in Docker Desktop for testing purposes, as it includes a standalone Kubernetes server (single-node cluster) and a client. To enable it, launch Docker Desktop, go to Settings > Kubernetes and choose Enable Kubernetes.
With Docker Desktop you can also upload your images to a registry, as you can do with cloud providers like Amazon, Azure, etc. To run this local registry execute:
$ docker run -d -p 5000:5000 --name registry registry:2 |
The URI of this local registry is the default value of the DOCKER_IMAGE_PREFIX variable in the cluster.sh script of the distribution:
DOCKER_IMAGE_PREFIX=localhost:5000 |
Amazon EKS (Elastic Container Service for Kubernetes) is a managed Kubernetes service that allows you to run Kubernetes on AWS.
These are the steps for setting up an EKS cluster:
✸ Before the cluster.sh script can pull the images from Amazon ECR, you must create three repositories and push the distributed images (in the images folder), to them: postgresql (optional), hive-metastore and presto.
The Amazon Elastic Container Registry URI of the DOCKER_IMAGE_PREFIX variable in the cluster.sh script, has the following format:
DOCKER_IMAGE_PREFIX=<aws_account_id>.dkr.ecr.<region>.amazonaws.com |
For detailed information see https://docs.aws.amazon.com/eks/latest/userguide/what-is-eks.html.
Azure Kubernetes Service (AKS) is a managed Kubernetes service that allows you to run Kubernetes on Microsoft Azure.
The steps for setting up an AKS cluster are the following:
The Azure Container Registry URI of the DOCKER_IMAGE_PREFIX variable in the cluster.sh script, has the following format:
DOCKER_IMAGE_PREFIX=<registry>.azurecr.io |
For detailed information see https://docs.microsoft.com/en-us/azure/aks/tutorial-kubernetes-prepare-app
Red Hat OpenShift is an enterprise-ready Kubernetes container platform.
The steps for setting up an OpenShift cluster are the following:
The OpenShift Container Registry URI of the DOCKER_IMAGE_PREFIX variable in the cluster.sh script, has the following format:
DOCKER_IMAGE_PREFIX=default-route-openshift-image-registry.apps. |
For detailed information see https://docs.openshift.com/
S3-compatible storage allows access to and management of the data it stores over an S3 compliant interface, e.g. MinIO, IBM Cloud Object Storage, Dell EMC ECS,...
In order to connect to S3-compatible storage from the Denodo Presto cluster the following information is required:
Before building and deploying the Presto cluster you should modify the following configuration:
The Java Keystore contains a self-signed certificate, certs/certificate.crt, that is distributed ONLY for testing purposes. You are encouraged to use a certificate issued by a CA in production and import it, with its private key, in the Java Keystore of the Presto cluster or replace the existing one: presto/conf/presto.jks, with password sslpassphrase.
The Presto Cluster does not include a Persistent Volume object, since the user instantiating it may not have permission to create Persistent Volumes.
Therefore, there are two options:
Many cluster environments have a default StorageClass installed, or the administrators can create their own default StorageClass.
This sql file contains the DDL for the tables that will be created when the Presto cluster starts. The distributed schema.sql contains some CREATE TABLE sentences commented out, as an example, so no tables will be created if you don't modify this file.
Alternatively, you can use your favorite SQL client to create any table you want when the cluster is running. Or, make use of the metadata discovery tool, explained later in this document.
But if you want to use a PostgreSQL or MySQL external to the cluster, you can configure it using the METASTORE_CONNECTION_URL and METASTORE_CONNECTION_DRIVER_NAME variables of the cluster.sh script.
In addition, there is an initialization script for the external RDBMS, MySQL or PostgreSQL, included in the hive-metastore/scripts directory that should be run on that RDBMS.
Let's see now how to deploy Presto on Kubernetes!
The script cluster.sh , in the root folder of the distribution, automatizes the whole process explained in the Long Version section.
Usage: cluster.sh delete | deploy OPTION | start | stop
OPTIONS: --abfs-storage-account Azure Data Lake Gen 2 Storage Account --abfs-storage-key Azure Data Lake Gen 2 Storage Key --adl-client-id Azure Data Lake Gen 1 Client Id --adl-client-secret Azure Data Lake Gen 1 Client Secret --adl-tenant-id Azure Data Lake Gen 1 Tenant Id --s3-access-key AWS Access Key ID --s3-secret-access AWS Secret Access Key --wasb-storage-account Azure Blob Storage Account --wasb-storage-key Azure Blob Storage Key --credstore-password Password to protect the credentials provider --metastore-password Hive Metastore database password when using an external database |
cluster.sh script has three prerequisites:
$ docker load < postgresql-image-v<version>.tar.gz $ docker tag postgresql:v<version> <repository>/postgresql:v<version> $ docker push <repository>/postgresql:v<version> $ docker load < hive-metastore-image-v<version>.tar.gz $ docker tag hive-metastore:v<version> <repository>/hive-metastore:v<version> $ docker push <repository>/hive-metastore:v<version> $ docker load < presto-image-v<version>.tar.gz $ docker tag presto:v<version> <repository>/presto:v<version> $ docker push <repository>/presto:v<version> |
And the following variables you can tune:
By default, the PostgreSQL internal to the cluster will be used: jdbc:postgresql://postgresql:5432/metastore
org.postgresql.Driver for PostgreSQL and org.mariadb.jdbc.Driver for MySQL.
cluster.sh deploy [OPTION]: deploys a cluster that accesses datasets in S3, Azure Blob Storage, Azure Data Lake Gen 1 or Azure Data Lake Gen 2, depending on the credentials specified by the user using the options available.
S3 datasets
There are two ways for accessing the S3 datasets:
You should also check the DOCKER_IMAGE_PREFIX variable in the cluster.sh to point to a valid Container Registry.
The Amazon Elastic Container Registry URI has the following format: "<aws_account_id>.dkr.ecr.<region>.amazonaws.com".
Azure Blob Storage datasets
Supply these options to the cluster.sh script:
You should also check the DOCKER_IMAGE_PREFIX variable in the cluster.sh to point to a valid Container Registry
The Azure Container Registry URI has the following format: "<registry>.azurecr.io".
Azure Data Lake Gen 1 datasets
Supply these options to the cluster.sh script:
You should also check the DOCKER_IMAGE_PREFIX variable in the cluster.sh to point to a valid Container Registry.
The Azure Container Registry URI has the following format: "<registry>.azurecr.io".
Azure Data Lake Gen 2 datasets.
Supply these options to the cluster.sh script:
You should also check the DOCKER_IMAGE_PREFIX variable in the cluster.sh to point to a valid Container Registry
The Azure Container Registry URI has the following format: "<registry>.azurecr.io".
cluster.sh start: recreates a previously stopped cluster maintaining its Presto table definitions.
cluster.sh stop: deletes the cluster but keeps the table definitions that Presto rely upon.
cluster.sh delete: removes the cluster and all the configuration created for it, destroying the metadata persistence, that is, the Presto table definitions.
$ docker login ... $ docker load < postgresql-image-v<version>.tar.gz $ docker tag postgresql:v<version> <repository>/postgresql:v<version> $ docker push <repository>/postgresql:v<version> $ docker load < hive-metastore-image-v<version>.tar.gz $ docker tag hive-metastore:v<version> <repository>/hive-metastore:v<version> $ docker push <repository>/hive-metastore:v<version> $ docker load < presto-image-v<version>.tar.gz $ docker tag presto:v<version> <repository>/presto:v<version> $ docker push <repository>/presto:v<version> |
Then, create the secrets with the credentials file and the password protecting it:
$ kubectl create secret generic file-credentials --from-file="$CREDS_PATH"/creds.jceks $ kubectl create secret generic hadoop-credentials --from-literal=HADOOP_CREDSTORE_PASSWORD="${HADOOP_CREDSTORE_PASSWORD}" |
$ cd denodo-presto-k8scluster-<VERSION> $ kubectl create configmap hive-metastore-bin --from-literal=METASTORE_CONNECTION_URL="${METASTORE_CONNECTION_URL}" \ --from-literal=METASTORE_CONNECTION_DRIVER_NAME="${METASTORE_CONNECTION_DRIVER_NAME}" $ kubectl create configmap hive-metastore-conf --from-file=hive-metastore/conf/ $ kubectl create configmap presto-conf --from-file=presto/conf/ |
$ cd denodo-presto-k8scluster-<VERSION> $ kubectl apply -f hive-metastore-pvc.yaml |
$ cd denodo-presto-k8scluster-<VERSION> $ kubectl apply -f postgresql.yaml |
$ cd denodo-presto-k8scluster-<VERSION> $ kubectl apply -f hive-metastore.yaml |
$ cd denodo-presto-k8scluster-<VERSION> $ kubectl apply -f presto.yaml |
Kubernetes pods status
Once deployed, Presto is accessible at External-IP:8443 of the presto service. Run:
$ kubectl get svc |
to find out the external IP.
Kubernetes services status
Presto is accessible at:
Presto Web UI https://51.105.112.6:8443/ui/ JDBC URI jdbc:presto://52.5.134.190:8443/hive/default?SSL=true ↓ (TLS certificate) jdbc:presto://presto-denodo:8443/hive/default?SSL=true
User presto Password pr3st% |
/etc/hosts file
Do not forget to delete all the resources you have created, when you do not need the cluster anymore.
$ kubectl delete -f presto.yaml |
$ kubectl delete -f hive-metastore.yaml |
$ kubectl delete -f postgresql.yaml |
$ kubectl delete -f hive-metastore-pvc.yaml |
In case you are using the PostgreSQL internal to the cluster and you want to delete the persistent volume, but save the cluster metadata for a later deployment, you need to execute a dump over the PostgreSQL database of the Hive Metastore, before deleting the cluster. This dump generates a text file with SQL commands that, when fed back to the Hive Metastore, will recreate the database in the same state as it was at the time of the dump.
Dump in the old cluster:
$ kubectl exec <PostgreSQL Pod> -- bash -c "PGPASSWORD=hive pg_dump -c -U hive -h localhost metastore" > database.sql |
Restore in the new cluster:
$ cat database.sql | kubectl exec -i <PostgreSQL Pod> -- bash -c "PGPASSWORD=hive psql -U hive -h localhost -d metastore" |
✸ Note that restore the database is the first thing you should do before creating new tables in the Presto cluster.
When talking about Presto, fewer bigger machines is better than more smaller machines. It is recommended that you start with big balanced machines, like these in AWS:
Presto requires a certain amount of CPU to process the data for a given query. And generally, if you double the CPU in the cluster, maintaining the same memory, the query will take half of the time.
So, as a generalization, more CPUs mean shorter queries.
Memory is about the data structures required to run the query, and is needed for JOIN, GROUP BY, and ORDER BY operations.
As a generalization, allocate 80% of the memory machine to the Java heap, -Xmx option in file presto/conf/jvm.config - this is done automatically by the cluster.sh script of this distribution.
Other properties you can configure, regarding memory are the query.max-memory and query.max-memory-per-node. In this distribution we used their default values. But they can be tuned in files presto/conf/config.properties.coordinator and presto/conf/config.properties.worker.
But if the cluster needs to handle bigger queries, you should make query.max-memory bigger. For this, the recommendation is to run the EXPLAIN ANALYZE with these big queries and see what the peak max memory is examining the query plan.
If you are suffering from network or IO bandwidth, it is recommended that you compress your files:
Also, avoid small files, any file less than 8 MB is considered small. Hundred of MB is the recommended size.
Do not use tiny row groups, default is 128 MB, but some tools generate Parquet files with typical size of 4k, causing bad IO patterns in Presto.
When you deploy the Presto Cluster on Kubernetes, you define how many Presto workers (NUM_WORKERS) you would like to run. But you can scale the Presto cluster afterwards, increasing or decreasing the number of Presto workers.
You can scale the cluster manually with the kubectl scale command, changing the number of Presto workers (replicas) you want to run:
$ kubectl scale deployment/presto-worker --replicas=6 |
Needless to say, the cluster should have sufficient resources available.
You can scale the cluster in an automatic manner with the kubectl autoscale command, configuring a maximum and minimum number of replicas for your Presto workers, and a CPU utilization target. The autoscaler (HorizontalPodAutoscaler) will periodically adjust the number of replicas to match the average CPU utilization that you have specified:
$ kubectl autoscale deployment/presto-worker --min=2 --max=6 --cpu-percent=80 |
You can check the current status of the HorizontalPodAutoscaler running:
$ kubectl get hpa |
The output is similar to:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS presto-worker Deployment/presto-worker 22%/80% 2 6 2 |
There are two requirements to ensure the HorizontalPodAutoscaler (HPA) works and therefore the kubectl autoscale command.
See https://github.com/kubernetes-sigs/metrics-server#installation for instructions on how to deploy the Metrics Server add-on. Metrics Server provides CPU/Memory based horizontal autoscaling.
See the image below for an example on how to configure the presto-worker deployment in the presto-template.yaml file:
Note, that resources are not configured by default, as we leave this setting as a conscious choice for the cluster administrator.
To access datasets from Presto, you have to map the AWS S3 data files (or Azure Blob Storage, or Azure Data Lake, etc..) to tables in the Hive Metastore. The mapping is done through the CREATE TABLE sentence, where you have to provide the schema of the data, the file format, and the data location.
CREATE TABLE orders ( orderkey bigint, custkey bigint, orderstatus varchar, totalprice double, orderpriority varchar, clerk varchar, shippriority integer, comment varchar ) WITH ( external_location = 's3a://my_bucket/path/to/folder/', format = 'ORC' ); |
✸ Don't forget to use a trailing slash, otherwise Presto will throw the following error: Can not create a Path from an empty string.
The WITH clause of the CREATE TABLE can also be used to set other properties on the table. The available properties are described in the Appendix: Hive Table Properties.
The file presto/conf/schema.sql contains the CREATE TABLE sentences that will be executed immediately after starting the Presto cluster. Alternatively, you can use your favorite SQL client to create any table you want when the cluster is running. Or execute the metadata discovery tool explained in the next section.
The metadata discovery tool explores the distributed storage, starting from a folder given by the user, and from every Paquet dataset it finds, in a folder or subfolder, it creates the corresponding table in Presto. Then, it calls a Denodo stored procedure that introspects Presto and creates the Denodo base views for all the new tables created in Presto.
The metadata discovery tool can create Presto tables from partitioned datasets which follows the Hive naming convention for the folders, that is: columnName=value, e.g. elb_name=abc.
.../weblogs/elb_name=abc/elb_response_code=200/…
.../weblogs/elb_name=abc/elb_response_code=404/…
.../weblogs/elb_name=xyz/elb_response_code=200/…
The metadata discovery tool invokes sync_partition_metadata automatically, after each partition table creation.
The tool is distributed both as a script and as a Denodo stored procedure.
The metadata discovery script is distributed in the folder metadata-discovery/bin. A configuration file is required as input:
$ denodo-presto-metadata-discovery.sh ../conf/configuration.properties |
dirToExplore=s3a://example-bucket/dir/ credentials.xml.path=../conf/core-site.xml #includedPattern=(.*)blahblah(.*) # Denodo connection properties denodo.jdbcUrl=jdbc:vdb://localhost:9999/presto_tests denodo.username=admin denodo.password=admin # Presto connection properties # Requisites for this code to be able to connect to Presto: # - import certificate.crt in the TrustStore of the JVM (the one # executing this code and the Denodo's JVM) # OR # - use the JDBC URI parameters (SSLTrustStorePath and # SSLTrustStorePassword) to provide the path # to a TrustStore with the certificate.crt presto.jdbcUrl=jdbc:presto://localhost:8443/hive/default?SSL=true\ &SSLTrustStorePath=truststore-path&SSLTrustStorePassword=truststore-pwd presto.username=presto presto.password=pr3st% # Default values denodo.datasource.name=ds_presto_k8s denodo.datasource.folder=/01 - Connectivity/01 - Data Sources denodo.baseview.folder=/01 - Connectivity/02 - Base Views denodo.baseview.prefix=bv_ presto.analyze.tables=true # Possible values: NEVER_REPLACE, REPLACE_IF_NEW_COLUMNS, REPLACE_ALWAYS metadata.duplicated.strategy=NEVER_REPLACE |
configuration.properties
Requisites to be able to connect to Presto |
OR
The certs/certificate.crt is distributed ONLY for testing purposes. This certificate accepts localhost and presto-denodo as Presto host names:
In this case, you have to add an entry in the hosts file where Denodo server is running, with presto-denodo and the IP that appears as the EXTERNAL-IP of the presto service. Kubernetes services status /etc/hosts file But you are encouraged to use a certificate issued by a CA in production. And this certificate will be the one that you have to import in the TrustStore of the client code and in the Java Keystore, with its private key, of the Presto cluster (presto/conf/presto.jks) before deploying it. |
This can affect existing base views at Denodo that had modifications made by the user, like renamed fields, field types modifications, cache configuration... as base views will be replaced and these kinds of changes will be lost.
Our recommendation is to always leave Denodo base views as is and make changes only in views that derive from these base views.
This can affect existing base views at Denodo that had modifications made by the user, like renamed fields, field types modifications, cache configuration... as base views will be replaced and these kinds of changes will be lost.
Our recommendation is to always leave Denodo base views as is and make changes only in views that derive from these base views.
The output of the process would be:
METADATA DISCOVERY RESULTS ****************************************** Commands executed at Presto: ------------------------------------------ CREATE TABLE "customer" ( ANALYZE "customer" CREATE TABLE "store" ( ANALYZE TABLE "store" CREATE TABLE "store_sales" ( ANALYZE "store_sales" CREATE TABLE "date_dim" ( ANALYZE "date_dim" ------------------------------------------ Commands executed at Denodo (data source): -------------------------------------------------------------- CREATE FOLDER '/01 - Connectivity'; CREATE FOLDER '/01 - Connectivity/01 - Data Sources'; CREATE DATASOURCE JDBC "ds_presto_k8s" ------------------------------------------ Commands executed at Denodo (base views): ------------------------------------------------------------- CREATE OR REPLACE FOLDER '/01 - Connectivity/02 - Base Views'; CREATE OR REPLACE WRAPPER JDBC bv_customer CREATE OR REPLACE TABLE bv_customer I18N us_pst ( CREATE OR REPLACE WRAPPER JDBC bv_date_dim CREATE OR REPLACE TABLE bv_date_dim I18N us_pst ( CREATE OR REPLACE WRAPPER JDBC bv_store CREATE OR REPLACE TABLE bv_store I18N us_pst ( CREATE OR REPLACE WRAPPER JDBC bv_store_sales CREATE OR REPLACE TABLE bv_store_sales I18N us_pst ( |
Output of the metadata discovery process
The DISCOVER_PRESTO_METADATA stored procedure is distributed in the folder metadata-discovery/sp/denodo-presto-k8scluster-{version}-jar-with-dependencies.jar.
To be able to use this store procedure in Denodo, first, you need to import the denodo-presto-k8scluster-{version}-jar-with-dependencies.jar file using the File > Extension management menu option of the VDP Administration.
Second, you have to create the stored procedure by clicking on the menu File > New > Stored procedure option, in the VDP Administration, and provide the following values:
Name: a name
Class name: com.denodo.connect.presto.MetadataDiscoveryStoredProcedure
Select Jars: denodo-discover-presto-metadata-<version>
FInally, you can execute the stored procedure:
SELECT * FROM discoverprestometadata() WHERE dir_to_explore='s3a://bucket/dir/' and credentials_path='/absolute/path/to/core-site.xml' and datasource_name='the_presto_datasource' and duplicate_strategy='NEVER_REPLACE' and included_pattern=NULL and baseview_folder='/01 - Connectivity/02 - Base Views' and baseview_prefix='bv_' and analyze_tables=true |
The input parameters required by the stored procedure are:
The base views created by this stored procedure will belong to this Presto data source.
The following input parameters are optional:
This can affect existing base views at Denodo that had modifications made by the user, like renamed fields, field types modifications, cache configuration... as base views will be replaced and these kinds of changes will be lost.
Our recommendation is to always leave Denodo base views as is and make changes only in views that derive from these base views.
This can affect existing base views at Denodo that had modifications made by the user, like renamed fields, field types modifications, cache configuration... as base views will be replaced and these kinds of changes will be lost.
Our recommendation is to always leave Denodo base views as is and make changes only in views that derive from these base views.
This command collects statistical information about the data that the Presto optimizer uses to plan the query, based on cost strategies.
The output of this stored procedure is EXECUTED_COMMAND, that shows the DDL statements executed successfully at Presto and at Denodo:
If your queries are complex and include joins of large data sets, running ANALYZE on tables/partitions may improve query performance. The metadata discovery tool runs ANALYZE automatically, by default, after each table creation.
This command collects statistical information about the data that the Presto optimizer uses to plan the query, based on cost strategies.
Once again, you can use your favorite SQL client to analyze the tables you want when the cluster is running.
ANALYZE store_sales10; |
The image above shows an example of the statistics collected by Presto after running ANALYZE table.
SHOW stats for store_sales10:
Table statistics
Tables can have partitions, as they can improve query performance, especially when the columns being queried for are the partitioned ones. This means that, for each column value of the partitioned column, there will be a separate folder under the location of the dataset.
Presto supports tables from partitioned datasets which follows the Hive naming convention for the folders, that is: columnName=value, e.g. elb_name=abc.
.../weblogs/elb_name=abc/elb_response_code=200/…
.../weblogs/elb_name=abc/elb_response_code=404/…
.../weblogs/elb_name=xyz/elb_response_code=200/…
CREATE TABLE IF NOT EXISTS weblogs ( request_timestamp varchar, request_ip varchar, request_port integer, backend_ip varchar, backend_port integer, request_processing_time double, backend_processing_time double, client_response_time double, received_bytes bigint, sent_bytes bigint, request_verb varchar, url varchar, protocol varchar, elb_name varchar, elb_response_code varchar) WITH ( format = 'parquet', partitioned_by = ARRAY['elb_name', 'elb_response_code'], external_location = s3://my_bucket/path/to/folder/' ); |
✸ Note that Hive requires the partition columns to be the last columns in the CREATE TABLE sentence.
However, if we query a partition table after creating it, we will find that it returns zero results. The Hive Metastore needs to discover which partitions exist by querying the underlying storage system. The Presto procedure system.sync_partition_metadata(schema_name, table_name, mode) is in charge of detecting the existence of partitions.
When data is inserted in new partitions we need to invoke the sync_partition_metadata procedure again, to discover the new records.
call system.sync_partition_metadata('default', 'weblogs', 'FULL'); |
The metadata discovery tool invokes sync_partition_metadata automatically, after each partition table creation.
To establish a connection to Presto you have to create a JDBC data source in Denodo and fill all the required fields:
Note that, if you do not specify an alternative Java TrustStore in the Database URI, with the parameters SSLTrustStorePath and SSLTrustStorePassword, (see all the URI parameters accepted), you have to import the server certificate (certs/certificate.crt) in the Denodo server trust store.
The certs/certificate.crt is distributed ONLY for testing purposes. This certificate accepts localhost and presto-denodo as Presto host names:
In this case, you have to add an entry in the hosts file where Denodo server is running, with presto-denodo and the IP that appears as the EXTERNAL-IP of the presto service.
Kubernetes services status
/etc/hosts file
But you are encouraged to use a certificate issued by a CA in production. And this certificate will be the one that you have to import in the TrustStore of the Denodo server and in the Java Keystore, with its private key, of the Presto cluster (presto/conf/presto.jks) before deploying it.
Presto datasource in Denodo
Also, it is recommended to configure these two properties in the Source Configuration tab of the data source:
as these values are used by the Denodo optimizer for its estimations.
Source Configuration tab in Denodo
Once the data source is configured, click on "Create base View" to explore Presto schemas with their tables and their columns.
Presto tables in Denodo
Finally, you have reached your goal! You can run any query you want on your S3 datasets.
Query results in Denodo
The Presto data source can be used to perform bulk data loads.
Before configuring the Bulk Data Load in Denodo you have to create a new schema in Presto that sets the location where the parquet files, created by Denodo, will be located.
For this you can use your favorite SQL client:
CREATE SCHEMA hive.<schema_name> WITH (location = '<filesystem_schema>://<host>/<folders>'); |
To configure the Bulk Data Load, check Use Bulk Data Load APIs of the Presto data source in its Read & Write tab, and fill the parameters: server timezone, storage path (adl://<adl_host>/<dirs> in the example), and the catalog and schema:
Then, depending on the file system chosen, you may have to add some Hadoop properties to configure the authentication. In the example, the properties to configure the access to Azure Data Lake Storage are:
Finally, press the Test bulk load button to check that all is working fine:
✸ When using S3 storage, you should set the Virtual DataPort property 'com.denodo.vdb.util.tablemanagement.sql.PrestoTableManager.useExternalTables' to 'false' if you receive the error: External location must be a directory.
The process of upgrading to a new version of the Denodo Presto Cluster on Kubernetes involves transferring the settings and the metadata from the version installed to the new one.
Step1. Transferring the settings.
The configuration of the cluster resides in hive-metastore/conf and presto/conf. You need to be aware of the modifications you made in those locations and make these changes in the new version files before executing the deployment of the new cluster.
Step 2. Transferring the metadata.
In case you were using the PostgreSQL internal to the cluster, you need to save the metadata from the old cluster executing a dump over the PostgreSQL database of the Hive Metastore, before deleting the cluster. This dump generates a text file with SQL commands that, when fed back to the Hive Metastore of the new cluster, will recreate the database in the same state as it was at the time of the dump: keeping the whole Presto schema metadata, that is, your datasets exposed as tables.
Dump in the old cluster:
$ kubectl exec <old Hive Metastore Pod> -- bash -c "PGPASSWORD=hive pg_dump -c -U hive -h localhost metastore" > database.sql |
Restore in the new cluster:
$ cat database.sql | kubectl exec -i <new Hive Metastore Pod> -- bash -c "PGPASSWORD=hive psql -U hive -h localhost -d metastore" |
✸ Note that restore the database is the first thing you should do before creating new tables in the Presto cluster.
The optional WITH clause of the CREATE TABLE can be used to set properties on the newly created table or on single columns.
Available table properties are: