Summary Rewrite Optimization

This section describes how the Denodo query optimizer uses the summary views to rewrite queries and some best practices.

Let us start with a few examples to understand what is the summary rewrite and how it works. To illustrate these examples, we will use entities from the TPC-DS benchmark, which models the decision support functions of a retail product supplier. These entities include “store_sales” containing the sales information, “date_dim” representing the date dimension, “store”, and so on.

To keep things simple let us consider we create a summary containing the total sales aggregated by the store id and the date id as follows:

SELECT ss_sold_date_sk as sold_date_id, ss_store_sk as store_id, sum(ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales
FROM store_sales
GROUP by ss_sold_date_sk, ss_store_sk

Then the query optimizer can decide to use that partial information to speed up different queries. Let’s see a couple of examples:

Total sales in 2019

SELECT sum(store_sales.ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales
FROM store_sales INNER JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
WHERE date_dim.d_year = 2019
../../../../../_images/summary_rewrite_trace_example1.png

For this query, the query optimizer will use the total sales by date and store that is “cached” in the summary and just apply the filter to get the year 2019 and the final aggregation.

Total sales by store division

SELECT s_division_name,  sum(store_sales.ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales
FROM store_sales INNER JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
JOIN store ON(store.s_store_sk = store_sales.ss_store_sk)
GROUP BY s_division_name
../../../../../_images/summary_rewrite_trace_example2.png

For this query, the query optimizer will use the same summary. It just needs to join the partial sales by store that are materialized in the summary, join that information with the “store” data to obtain the store division and perform the final aggregation by the division name.

These are just two examples of possible queries that could take advantage of this feature but there are many more. As you can see it is a very powerful resource as it is very flexible and can be used in many situations.

Main Use Cases

The most important use case for summary views is the one requiring self-service queries on a logical data warehouse. This is a situation where users request data that do not require to be full real time and they will access canonical views. Canonical views are views that correspond with business entities like sales, orders, customer, not auxiliary cached views or cached reports.

In this environment, summary views increase the performance of the query and reduce the load of the data sources involved in the queries.

../../../../../_images/summary_rewrite_self_service_queries.png

Summaries also have a big impact is on hybrid environments. For example, during an ongoing migration process to the cloud (see image below). In this scenario, some data have been already migrated to the cloud, but not all. Thanks to the abstraction layer that Denodo provides, client applications are able to operate the same way without knowing these changes are happening. In this hybrid environment, it is a good idea to create some summaries in the cloud database combining data stored in some of the data sources that have not been migrated yet. By doing this, Virtual DataPort will be able to run most of the queries without having to access and transfer data from the on-premise databases.

../../../../../_images/summary_rewrite_migration_use_case.png

Another example of hybrid environment would be a multi-cloud system. If you have data in different cloud providers, it is a good idea to have also a Denodo server/s on each cloud so the data is processed as close to the data as possible and minimize data transfer. In this scenario, it is also useful to create summaries on each location, containing common combinations with data in other locations. This way you can avoid accessing the other cloud or the on-premise data.

../../../../../_images/summary_rewrite_multicloud.png

Caching vs Summary Rewriting

Summaries have some advantages over the traditional cache:

  • Users need to explicitly reference the cached view in order to use the precomputed data. The query optimizer analyzes and rewrites the incoming queries to take advantage of the data in the summaries whenever possible. It does so transparently, without user intervention; the user does not need to specify that the query needs to use a summary.

  • Summaries are not restricted to the cache data source. This means that when you create a summary you can choose the best data source alternative for that particular data (for example, to maximize colocation). You can also store the same data in multiple data sources so the optimizer can choose the best option for each case.

On the other hand:

  • Caching offers more refresh and management options, such as: implicit caching mode, incremental queries, atomic or non-atomic loads, selective invalidation.

  • Caching does not require a pre-processing phase to decide whether to use it or not. So, in cases where you know you always want to access a certain data in the same way it is better to continue using traditional cache, to avoid any unnecessary overhead.

To decide what is best for each case, take a look at the different abstraction layers you find when you model a logical architecture (See image below):

../../../../../_images/cache_vs_summaries.png
  • Base or physical layer, containing the original data sets.

  • Integration layer, containing the intermediate transformations and combinations needed to create the views exposed to consumers.

  • Semantic or business layer, containing the canonical views of your model representing your business entities. The views in this layer are typically exposed to data consumers, who can perform ad-hoc queries over them.

  • Reporting layer, containing pre-canned reports with some calculated metrics and final views you want to publish to certain clients.

You can see how the differences between caching and summaries make them more suitable for some of these modeling layers and not others. Therefore, the general recommendations are:

  1. Create summaries over the canonical views of your model (semantic and business layer) that are exposed to data consumers such as business users or data scientists, which need to execute ad-hoc queries joining and aggregating different views in a self-service way.

  2. Continue using caching in these scenarios:

    • For final, pre-canned reports, with calculated metrics ready to be consumed by users (Reporting layer). In these cases there are no further transformations and combinations beyond applying projections and filters, so you don’t need the flexibility of the summaries.

    • For intermediate views which are not directly exposed to the consumers. Having to explicitly reference cached views is not a problem in this case since only the developers use them directly (Integration layer), and caching offers more refresh and management options than summaries.

Recommendations When Using Summaries

Consider the following recommendations when creating a summary:

  • Store the summary in a data source that is either:

    1. Close to the data (for instance, the data source containing the facts table), or

    2. Or close to the Denodo servers (for instance, the cache data source).

  • In the query of the definition of the summary:

    • Always include count(*) in the SELECT, as it can be necessary to calculate queries using different levels of aggregations.

    • Avoid including HAVING conditions. They can prevent the use of the summary for more aggregated queries.

  • Gather the statistics for the summaries if cost-based optimizations are enabled:

    • The cost optimizer will decide between summaries and other alternatives using these statistics.

    • To gather the statistics of a summary, open the summary, go to the Options tab and then, to Statistics.

  • Consider creating indexes to ensure the best performance. See section Creating an Index in the Underlying Database of a Remote Table or a Summary for more details.

  • Keep the summary up to date by creating a Scheduler task and run it periodically.

  • If your model follows the single view approach, in which there is a view that combines the facts table with all its dimensions, always create the summaries over that single view.