Data Lake an overview
To overcome big data challenges, organizations are exploring data lakes as consolidated repositories of massive volumes of raw, detailed data of various types and formats. But creating a physical data lake presents its own hurdles, one of which is the need to store the data twice which can lead to governance challenges with regard to data access and quality. Also, data lakes can become data silos since they are often built to target particular departments, such as Marketing, and subsequently must be combined with other enterprise data (e.g., CRM, ERP, or other data lakes) for analysis
These include the following challenges
- Finding the right data in a sea of unstructured or semi-structured information
- Data quality problems from stale data
- Balancing security and access for sensitive data, and ensuring that queries run quickly enough to deliver timely results.
Overcoming the Limitations
Denodo Platform provides a cost-effective approach to combining, governing, and managing data in data lakes, and to overcoming the inherent challenges presented by physical data lake silos. To efficiently process the large data volumes often managed in a data lake, Denodo embeds a Presto-based MPP engine, which combined with Denodo's Smart Query Accelerator features, provide lightning-fast query performance at scale.
Denodo offers multiple methods to access the files stored in the distributed storage such as:
- Denodo Distributed File System Custom Wrapper
- JDBC datasources (Hive, Databricks, SAP HANA, Presto, etc)
- Denodo Lakehouse Accelerator
Denodo includes an Massive Parallel Processing (Denodo Lakehouse Accelerator) engine based on Presto:
- Provides an easy and performant mechanism to access data stored in object storages.
- Deployment is done on Kubernetes via Helm charts.
- Requires a Denodo Enterprise Plus license subscription.
Prerequisites
If you want to test the samples included in this tutorial, you need the following prerequisites:
- A Denodo Platform installation with a Denodo Enterprise Plus license.
- A Denodo Lakehouse Accelerator cluster deployed and registered in your Denodo Platform installation .
- The Denodo Lakehouse Accelerator has to be connected to any Object Storage (e.g. Amazon S3 or Azure Data Lake Storage )
- Denodo Platform with access to a data source with the tpc-ds data set (for example to the PostgreSQL data source included in the Denodo Community Lab environment)
- Sample Parquet Files available here that you must upload to the Object Storage. In our example, we will upload them to a denodo-training bucket.

Now that we have the object storage and Parquet files set up, let's move on to our use case. We'll access these Parquet files through the Denodo Lakehouse Accelerator engine to explore how Denodo Lakehouse Accelerator works.
In this section, we will do a brief summary about how to deploy the Denodo Lakehouse Accelerator engine in the kubernetes environment.
- Follow the instructions of the User Guide to deploy the Denodo Lakehouse Accelerator engine using helm charts in the Kubernetes environment.

- Once you have the cluster running, you have to register it in Denodo Platform running the following command:
$ ./register.sh --register-user <user> --register-password <password>

- The registration process consists of the creation of a new database admin_denodo_mpp, a new user denodo_mpp_user, and a special data source called embedded_mpp in Denodo Platform.

- Open the Web Design Studio (the default URL is http://localhost:9090/denodo-design-studio/ in a local installation) and navigate to the database admin_denodo_mpp and open the data source embedded_mpp and Click on the Validate Denodo Lakehouse Accelerator License button to validate that the registration has been done correctly.

The Denodo Lakehouse Accelerator data source allows you to explore an Object Storage like Amazon S3, Azure Data Lake Storage or HDFS, graphically and create base views over data stored in Parquet file or Delta Lake table format. We will learn on how to connect embedded_mpp data source to Denodo Lakehouse Accelerator
- To do so, from the Design Studio, open the embedded_mpp data source and add the routes you want to browse from that object storage in the Read & Write tab of the data source.

- Once you have saved the necessary credentials and routes, you can click on Create Base View to browse these routes and select the ones you want to import.

- While creating the base view, Denodo automatically detects the folders corresponding to tables in Parquet format (including those using the Hive style partitioning), Iceberg and Delta Lake format.
- Denodo will create the table in the Hive, Iceberg or Delta catalog. You can select the schema with Target schema drop-down at the bottom of the Create Base View dialog.

In order to access the data from the object storage, we could query the base views created in Denodo in the previous step.
For instance, to access the customer information stored in the object storage, you could execute the base view , bv_customer created on top of the customer parquet file.
- To do this, open the base view in the Design Studio, click on the
Queryicon on the view summary page, and then clickExecutein the query panel:

- Once you click on the execute button, Denodo sends a query to the Denodo Lakehouse Accelerator (Presto cluster) to retrieve the data from the customer hive table created on top of the parquet file stored in the object storage.

- Click on the
Execution Tracebutton and select theMPP Route Planto see the SQL sentence that Denodo is sending to Presto for retrieving the data:

In order to benefit from the Denodo Lakehouse Accelerator engine, we will now configure the Denodo query optimizer to consider this Denodo Lakehouse Accelerator for query acceleration in Denodo. For which go to Administration > Server Configuration > Query Optimizer > Enable the Parallel processing acceleration

What are the advantages?
- This is useful in scenarios where a query combines large amounts of Parquet data stored in an Object Storage such as S3, Azure Data Lake or HDFS with data in a different data source.
- In these cases, the Denodo query optimizer may decide to send the query to the Denodo Lakehouse Accelerator. The Denodo Lakehouse Accelerator can access the data in Object Storage using its own engine and can access the data outside the Object Storage in streaming through Denodo, without the need to create temporary tables or files.
How does the Denodo Lakehouse Accelerator work?
Now, let's go through an example to see how Denodo Lakehouse Accelerator works by joining data from the Object Storage with data from a relational database (tpc_ds).
- Use this VQL to create a base view over the sample PostgreSQL data source. You have to download the VQL and in the Design studio navigate to
File > Importand select the downloaded VQL.
- To take advantage of the Denodo Lakehouse Accelerator technique, it is necessary to store the metadata of Virtual DataPort in an external database.
- Once the above command is executed, execute the below query in the VQL Shell and review the execution trace of the query to see how Denodo Lakehouse Accelerator accesses the data outside of the object storage seamlessly:
SELECT c_customer_sk, c_birth_country, sum(bv_store_sales.ss_sales_price) AS total_bought FROM bv_store_sales INNER JOIN admin_denodo_mpp.bv_customer ON admin_denodo_mpp.bv_customer.c_customer_sk = bv_store_sales.ss_customer_sk GROUP BY admin_denodo_mpp.bv_customer.c_customer_sk, admin_denodo_mpp.bv_customer.c_birth_country
- Execution trace of the query with Denodo Lakehouse acceleration: the query has been completely delegated to our Denodo Lakehouse Accelerator cluster!

- In the Execution Trace, click on the
MPP Route Planand then in theSQL sentenceyou can see Denodo is sending an equivalent query to the Presto cluster wherevdp_tableis the customer table in hive (from parquet) and the other view is a temporary view created in Denodo over store_sales view in database admin_denodo_mpp that Presto will access using the connector. - Let's execute now again the query but disabling the mpp Accelerator:
SELECT
c_customer_sk, c_birth_country,
sum(bv_store_sales.ss_sales_price) AS total_bought
FROM bv_store_sales INNER JOIN admin_denodo_mpp.bv_customer
ON admin_denodo_mpp.bv_customer.c_customer_sk = bv_store_sales.ss_customer_sk
GROUP BY
admin_denodo_mpp.bv_customer.c_customer_sk,
admin_denodo_mpp.bv_customer.c_birth_country
CONTEXT('mpp_acceleration'='off')

Now the query is not completely delegated to the Presto cluster. As you can see the join is executed in Denodo with the data obtained from both data sources:

Now, let's try to store data in the Object Storage using the Remote Table functionality of Denodo. The Embedded MPP data source can be used to store on a Distributed Object Storage like HDFS, Amazon S3 or ADLS.
How does it work?
When you create a remote table using the Denodo Lakehouse Accelerator as target data source, Denodo automatically performs a bulk data load on the Object Storage in order to upload the data in the desired format.
The following steps are performed by Denodo in the background when you create a Remote table:
- First, it generates temporary files containing the data to insert in Parquet file format or Iceberg table format.
- Then uploads those files to the specific path configured for the Remote table (S3 or HDFS URI).
- And finally, it makes the necessary operations to make sure the database table takes the data from the path provided.
To create a remote table in Denodo, we can follow the below steps:
- First, we have to enable the
Bulk Data Loadin theDenodo Embedded MPP data source - To configure it, open the
embedded_mppdatasource, and navigate to theRead Writetab in the data source configuration. - Under the
Write settings, enableUse Bulk Data Load APIsoption and enter the Object Storage URI of the S3 bucket to which Virtual DataPort will upload the data. - In addition to that, configure the authentication details for the object storage path specified:

Now it's time to save the configuration and Test bulk load configuration:

Ok, let's create a remote table for the query used in the Denodo Lakehouse Accelerator section.
- To create a remote table, Click on the three-dots icon next to embedded_mpp data source and click
Replication (remote table)on the menu New.
- Fill the new remote table form selecting the target Catalog and Schema in the Object Storage and the VQL query to be executed in Denodo to get the data that is going to be moved to the Remote table:

- Click on
Createbutton in the Remote table form to create it:
- Once the remote table is created, the parquet file will be created in the specified bucket:


You can validate the remote tables by executing the query against the base view created on top of it:

Congratulations 🥳, you have learned about the Denodo Lakehouse Accelerator and its features. You can try other tutorials to learn more about other Denodo functionalities!
