Summary Rewrite Optimization¶
This section describes how the Denodo query optimizer uses the summary views to rewrite queries and some best practices.
Let us start with a few examples to understand what is the summary rewrite and how it works. To illustrate these examples, we will use entities from the TPC-DS benchmark, which models the decision support functions of a retail product supplier. These entities include “store_sales” containing the sales information, “date_dim” representing the date dimension, “store”, and so on.
Let us consider we create a summary containing the total sales aggregated by the store id, sold date id and year as follows:
SELECT store_sales.ss_store_sk AS store_id
,store_sales.ss_sold_date_sk AS date_id
,date_dim.d_year
,sum(store_sales.ss_net_paid_inc_tax) AS total_sales
,count(*) AS count_sales
FROM store_sales
INNER JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
GROUP BY store_sales.ss_store_sk
,store_sales.ss_sold_date_sk
,date_dim.d_year
Then the query optimizer can decide to use that partial information to speed up different queries. Let’s see a couple of examples:
Total sales in 2019
SELECT sum(store_sales.ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales
FROM store_sales INNER JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
WHERE date_dim.d_year = 2019
For this query, the query optimizer will use the total sales by date and store that is “cached” in the summary and just apply the filter to get the year 2019 and the final aggregation.
Total sales by store division
SELECT s_division_name, sum(store_sales.ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales
FROM store_sales INNER JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
JOIN store ON(store.s_store_sk = store_sales.ss_store_sk)
GROUP BY s_division_name
For this query, the query optimizer will use the same summary. It just needs to join the partial sales by store that are materialized in the summary, join that information with the “store” data to obtain the store division and perform the final aggregation by the division name.
These are just two examples of possible queries that could take advantage of this feature but there are many more. As you can see it is a very powerful resource as it is very flexible and can be used in many situations.
Enable/Disable this Feature¶
Smart query acceleration is enabled by default and you can enable/disable it at three levels:
Global level: in the menu Administration > Server configuration; tab Queries optimization.
For a specific database: in the menu Administration > Database management; in the dialog Edit Database.
For a specific query: add the parameter
'summary_rewrite'='on'|'off'
to the CONTEXT clause of the query.
Disabling this feature on a specific query or a database can be useful in these cases:
When you know the summaries are outdated.
To compare the performance of a query when using summaries and when not using them.
Take into account that even if the feature is enabled the query optimizer will evaluate a specific summary if:
‘Enable for query rewrite’ option is selected.
the property ‘Valid for query rewrite’ is ‘Valid’. This property is only Valid if the table exists and it has been successfully loaded.
In addition, as summary views do not have time to live for their materialized data, it is important to configure regular loads to make sure the summary is up-to-date. For more information about the summary management visit section Managing Summaries.
Main Use Cases¶
The most important use case for summary views is the one requiring self-service queries on a logical data warehouse. This is a situation where users request data that do not require to be full real time and they will access canonical views. Canonical views are views that correspond with business entities like sales, orders, customer, not auxiliary cached views or cached reports.
In this environment, summary views increase the performance of the query and reduce the load of the data sources involved in the queries.
Summaries also have a big impact is on hybrid environments. For example, during an ongoing migration process to the cloud (see image below). In this scenario, some data have been already migrated to the cloud, but not all. Thanks to the abstraction layer that Denodo provides, client applications are able to operate the same way without knowing these changes are happening. In this hybrid environment, it is a good idea to create some summaries in the cloud database combining data stored in some of the data sources that have not been migrated yet. By doing this, Virtual DataPort will be able to run most of the queries without having to access and transfer data from the on-premise databases.
Another example of hybrid environment would be a multi-cloud system. If you have data in different cloud providers, it is a good idea to have also a Denodo server/s on each cloud so the data is processed as close to the data as possible and minimize data transfer. In this scenario, it is also useful to create summaries on each location, containing common combinations with data in other locations. This way you can avoid accessing the other cloud or the on-premise data.
Caching vs Summary Rewriting¶
Summaries have some advantages over the traditional cache:
Users need to explicitly reference the cached view in order to use the precomputed data. The query optimizer analyzes and rewrites the incoming queries to take advantage of the data in the summaries whenever possible. It does so transparently, without user intervention; the user does not need to specify that the query needs to use a summary.
Summaries are not restricted to the cache data source. This means that when you create a summary you can choose the best data source alternative for that particular data (for example, to maximize colocation). You can also store the same data in multiple data sources so the optimizer can choose the best option for each case.
On the other hand:
Caching offers more refresh and management options, such as: implicit caching mode, incremental queries, atomic or non-atomic loads, selective invalidation.
Caching does not require a pre-processing phase to decide whether to use it or not. So, in cases where you know you always want to access a certain data in the same way it is better to continue using traditional cache, to avoid any unnecessary overhead.
To decide what is best for each case, take a look at the different abstraction layers you find when you model a logical architecture (See image below):
Base or physical layer, containing the original data sets.
Integration layer, containing the intermediate transformations and combinations needed to create the views exposed to consumers.
Semantic or business layer, containing the canonical views of your model representing your business entities. The views in this layer are typically exposed to data consumers, who can perform ad-hoc queries over them.
Reporting layer, containing pre-canned reports with some calculated metrics and final views you want to publish to certain clients.
You can see how the differences between caching and summaries make them more suitable for some of these modeling layers and not others. Therefore, the general recommendations are:
Create summaries over the canonical views of your model (semantic and business layer) that are exposed to data consumers such as business users or data scientists, which need to execute ad-hoc queries joining and aggregating different views in a self-service way.
Continue using caching in these scenarios:
For final, pre-canned reports, with calculated metrics ready to be consumed by users (Reporting layer). In these cases there are no further transformations and combinations beyond applying projections and filters, so you don’t need the flexibility of the summaries.
For intermediate views which are not directly exposed to the consumers. Having to explicitly reference cached views is not a problem in this case since only the developers use them directly (Integration layer), and caching offers more refresh and management options than summaries.
Recommendations When Using Summaries¶
Consider the following recommendations when creating a summary:
Store the summary in a data source that is either:
Close to the data (for instance, the data source containing the facts table), or
close to the Denodo servers (for instance, the cache data source).
In the query used for the definition of the summary:
Always include count(*) in the SELECT clause, as it can be necessary to calculate queries using different levels of aggregations.
Avoid including HAVING conditions. They can prevent the use of the summary for more aggregated queries.
If you have derived views built as layers of projections/selections over other views with no joins, like a projections/selections stack, in the summary definition you should use the top level views that you wil expect in queries. This is because if a query consist of an aggregation over a projections/selections stack, or an aggregation over several joins among views defined as projections/selections stacks, the query optimizer will try to match those summaries referencing the top level views referenced in those joins and aggregations. For example, if you have the view customer on top of bv_customer, and your queries will use customer more often than bv_customer, you should use customer in your summary definition.
Make sure there are primary keys and referential constraints defined for the views referenced in the query when they apply. These information can allow the optimizer to use the summary in more situations. For example, in cases where the summary is equivalent but contains an additional join, if the FK-PK relationship ensures it does not alter the cardinality.
Gather the statistics for the summaries if cost-based optimizations are enabled:
The cost optimizer will decide between summaries and other alternatives using these statistics.
To gather the statistics of a summary, open the summary, go to the Options tab and then, to Statistics.
Consider creating indexes to ensure the best performance. See section Creating an Index in the Underlying Database of a Remote Table or a Summary for more details.
Keep the summary up to date by creating a VDPDataLoad using the Denodo Scheduler Module.
If your model follows the single view approach, in which there is a view that combines the facts table with all its dimensions, always create the summaries over that single view.
If possible, use the Summary Recommendations tool. It automatically recommends summaries based on your workload.