Parallel Processing¶
The Denodo optimizer provides native integration with several Massive Parallel Processing (MPP) systems to accelerate certain queries that require significant processing. In addition, Denodo includes embedded MPP capabilities.
Note
This feature is not available in Denodo Professional or Denodo Standard. See also Denodo Platform - Subscription Bundles.
The cost-based optimizer can detect if there exist operations that would benefit from using the MPP configured. Pushing of query processing to the MPP engine will be used when the query requires the processing of large amounts of data to be done in Denodo, and that processing cannot be done in streaming mode. In that case, the processing most likely will require the offloading to disk part of the data to avoid exhausting the server resources.
The optimizer decides when it is best to use the MPP on a query basis, with the decision being based on the type of operations specified in the query, the view statistics and the server configuration. These are the types of operations that could benefit from its use:
Operations that can be parallelized by an MPP, like aggregations
Operations that if post-processed by Denodo would trigger swapping to disk:
Hash JOINS
Group By
Order By
DISTINCT
Take into account the Query optimizer will choose this optimization or not based on the estimated query cost, and this requires having access to up to date statistics on the following views:
Base views
Cached views
Flatten views
Visit When Should You Enable Automatic Cost-Based Optimization? for more information.
Finally, it is possible to disable this optimization for a specific query using the CONTEXT option CONTEXT(‘mpp_acceleration’=’off’).
Data movement vs MPP Movement vs Embedded MPP Acceleration¶
There exist similarities among Data movement, MPP Movement and Embedded MPP Acceleration optimizations as all of them consist of delegating on a data source to execute a federated query transfering some of the query data to it.
However there are also major differences:
Data movement: The query optimizer decides to delegate a federated JOIN and potentially other operations on top to one of the underlying data sources. This requires creating temporary tables on the target data source and transferring there the content of the other views. This target data source can be any data source configured to enable data movements, including the one configured as external MPP if any.
MPP Movement: The query optimizer decides to delegate on an external MPP to execute a part of a query including: a federated join where none of the joined views belong to the MPP data source; an operation that estimates it requires swapping data to disk.
Embedded MPP Acceleration: Applies to the same scenarios for Data movement or MPP movement but this time the MPP data source is the Embedded MPP and therefore the Virtual DataPort engine will provide in streaming to the MPP engine the data that is not directly accessible. In addition, even if cost-based optimizations are disabled, the static optimizer will move the processing to the Embedded MPP when a join involves a branch that accesses it.
Due to their similarities take into account:
Section Customize the Data Movement Table Creation Command applies both to Data Movement and external MPP Movement.
Sections Options of the CONTEXT Clause that Control a Data Movement and Restrict Data Movement Optimizations From Sensitive Data Sources apply to the three of them.
Finally, execute the following command to check the execution plan and optimizations the Virtual DataPort engine will choose for a certain query:
DESC QUERYPLAN <query>
The top node of the execution plan contains information about the optimizations applied to the query.
Embedded MPP Acceleration¶
Denodo includes embedded Massively Parallel Processing (MPP) capabilities to improve performance on environments containing data in an object storage. For this purpose Denodo now embeds a customized version of Presto, which is an open source parallel SQL query engine that excels in accessing data lake content.
To deploy the Denodo Embedded MPP cluster, pull the Embedded MPP images from the Denodo’s Harbor Registry https://harbor.open.denodo.com or load them from the “Denodo Embedded MPP” Denodo Connect available at the Support Site. Then follow the instructions on the Embedded MPP Guide.
The Denodo Connect “Denodo Embedded MPP” includes an MPP engine based on Presto that has been customized to interact with the Denodo Platform. Among other things it configures the query optimizer to make use of this cluster for parallel processing (See section Denodo Embedded MPP for more information):
This configuration unlocks a new optimization technique that is not possible using an external MPP database. This is specially useful in scenarios where a query combines large amounts of data stored in an object storage like HDFS, S3 or ADLS with data in a different data source. In these cases, the query optimizer can decide to send the query to the MPP. The Denodo Embedded MPP can access the data in the object storage using its own engine and it can access the data outside the object storage in streaming through Denodo, without the need to create temporary tables or files.
This way the Denodo query engine can combine its powerful optimization techniques, federation and security capabilities along with parallel processing on big data.
Note
By default, this feature is only available for VDP Servers storing the metadata on an external database. This is necessary to make sure the Embedded MPP is able to access the data outside the object storage in an environment with several Virtual DataPort servers in cluster. In environments with just one Virtual DataPort server it is possible to avoid this restriction executing:
SET 'queryOptimization.parallelProcessing.denodoConnector.enableUsingSharedMetadataOnly'='false';
Configure the Denodo Embedded MPP for query acceleration¶
In order to benefit from the Embedded MPP query acceleration capabilities, there are a few configuration steps to consider:
Check the connection from the Embedded MPP to Denodo: This is configured during the Embedded MPP cluster deployment. To do that, verify the following command works successfully using the SQL Client at https://presto-denodo:8443/ui/sql_client.html, the Presto CLI or any JDBC client:
SHOW SCHEMAS FROM denodo
Check the connection from Denodo to the Embedded MPP: Open the Embedded MPP data source and click on Test connection.
Validate the License requirements: Open the Embedded MPP data source and click on Validate MPP License.
Configure Bulk Data Load: Open the Embedded MPP data source and make sure Use bulk data load APIs is enabled, configured and the Test Bulk Load finishes successfuly.
Configure the Query Optimizer: Navigate to Server configuration > Execution engine > Query optimizer and make sure:
‘Automatic simplification of queries’ and ‘Data movement’ are both on
Enable ‘Automatic cost-based optimization’ to maximize the benefits of Embedded MPP optimizations, otherwise the optimizer will only use it in very limited scenarios. Visit When Should You Enable Automatic Cost-Based Optimization? for more information.
‘Parallel processing acceleration’ is enabled and shows On (Embedded MPP) as in the image from previous section. If you don’t see this option, but the options from section Configure an External MPP for query acceleration instead, it means the Embedded MPP has not been correctly registered.
Note
If you have a cluster of Denodo servers, the Embedded MPP Acceleration requires to configure them to store their metadata in an external database. In environments with just one Virtual DataPort server it is possible to avoid this restriction executing:
SET 'queryOptimization.parallelProcessing.denodoConnector.enableUsingSharedMetadataOnly'='false';
Parallel Processing Using an External MPP¶
Denodo supports integration with the following Hadoop-based MPP systems:
PrestoDB version 0.1x and 0.2x
Databricks
Impala version 2.3
Trino version 4xx
Spark version 1.5, 1.6, 2.x and 3.x
Let us consider the following example:
We have a view customer in the data source DS1 that contains 10 million rows.
We have a view sales which is a union of two base views. One of the base views is recent_sales and contains the sales for the recent year. The other base view is old_sales and contains the sales information of previous years. The base view recent_sales is in the data source DS2 and contains one hundred million rows. old_sales is in the data source DS3 and contains one billion rows. These base views have an attribute customer_id that indicates the customer that a product was sold to in a particular sale. Therefore, on average, we have approximately 110 sales for each customer.
The following query obtains the amount of sales of each customer:
SELECT c.id as CustomerId, SUM(s.amount) as TotalSales
FROM Customer c JOIN Sales s ON c.id = s.customer_id
GROUP BY c.id
The Denodo optimizer will apply the aggregation push down optimization. This will aggregate the sales in each data source, reducing the number of rows returned from the data sources.
With this optimization, there are approximately 30 million rows transferred through the network, the 10 million customers and 10 million from each sales database. But Denodo would have to perform a join with 10 and 20 million rows in each branch respectively, and then aggregate the 20 million rows resulting from the join. The join and group by operations have to process a high volume of data, which will require the use of swapping to disk to avoid excessive memory consumption.
Parallel Processing: aggregation push down¶
With the parallel processing optimization enabled, the query optimizer, after applying the aggregation push down will move the customer data and the aggregated sales data to the MPP engine, executing the join and the aggregation leveraging the cluster resources.
Parallel Processing: aggregation push down and parallel processing¶
Configure an External MPP for query acceleration¶
In order to benefit from the MPP query acceleration capabilities, there are a few configuration steps to consider:
Configure Bulk Data Load: Open the data source and navigate to Read & Write tab under Configuration:
Enable the option Use bulk data load APIs and follow the instructions in section Bulk Data Load on a Distributed Object Storage like HDFS, S3 or ADLS to configure it. Click on Test Bulk Load to validate it is working correctly.
Select Allow creating temporary tables to allow parallel processing of any operation in section Query optimizations settings.
Configure the Query Optimizer:
Navigate to Server configuration > Execution engine > Query optimizer and make sure ‘Automatic simplification of queries’, ‘Automatic cost-based optimization’ and ‘Data movement’ are all on. In section Parallel processing acceleration select the data source.
Make sure Denodo has statistics for the views and they are up-to-date.
Use of an MPP Engine as Cache¶
The parallel processing capabilities of an MPP engine can be combined with the use of the same engine as the Denodo cache. This allows to take advantage of cached data that can be combined with the datasets moved to the MPP for query acceleration.
In some cases, a Denodo administrator might want to setup a different cache system for some databases, while allowing or disabling at the same time MPP query acceleration on those databases.
In this section we describe two examples of how to configure the server to achieve that behavior
Server cache with MPP enabled, with some VDP databases using the same cache with MPP disabled.
Configure the server cache with the MPP engine, enabling parallel processing in the data source settings.
In the queries optimization section of the server configuration, enable parallel processing and select the cache data source
Configure the databases with a custom cache data source, using the same connection parameters than the global cache, but without allowing MPP movements in the data source “Read & Write” configuration (Do not select parallel processing in the query optimization settings).
Server cache with MPP enabled, with some databases using a different cache system, allowing also MPP movements to the global cache.
Create a data source configured with the connection parameters to the MPP, with parallel processing enabled
In the server cache configuration, select that data source as the cache data source.
In the queries optimization section of the server configuration, enable parallel processing and select the same data source (with the option “Custom data source”)
Configure the databases with a different cache system using its own data source.
Force Movement of Views to the MPP Engine Using the CONTEXT Clause¶
It is possible to specify a CONTEXT clause when executing a query or creating a view
indicating the names of the views that should be moved to the MPP system. The syntax is
similar to the clause used to specify data movements, with the difference that the name of the context token is
MPPMOVEMENTPLAN.
The clause allows to specify one or several view names and for each appearance of the view in the query:
(ON)to move the view to the MPP(OFF)to disable movement of the view to the MPP()to avoid indicating a preference for that occurrence of the view
SELECT c.id as CustomerId, SUM(s.amount) as TotalSales
FROM customer c JOIN sales s ON c.id = s.customer_id
GROUP BY c.id CONTEXT(MPPMOVEMENTPLAN = customer: (ON) store_sales_oracle: (ON))
