Best Practices to Maximize Performance III: Caching

Applies to: Denodo 8.0 , Denodo 7.0
Last modified on: 31 Mar 2021
Tags: Optimization Performance

Download document

You can translate the document:

Overview

This document is the third part of a series of articles about how to achieve the best performance while working with the Denodo Platform.

These articles will provide best practices on a variety of different topics such as the way you should build your model, the different configuration settings you should use and the caching method that best suits your scenario. This guidance is especially important in analytic use cases which require integrating large volumes of data from different sources.

In the first document of this series you can find general guidelines on how to design views that are optimal from a performance point of view. The second document contains details regarding the meta-information and configuration settings connected to the Query Optimizer. This third article is aimed at Data Architects and Developers interested in taking advantage of the Denodo caching capabilities. It contains recommendations for different aspects of the Cache Module such as, how to choose the cache database, how to decide what views to cache, or what is the best cache mode and refresh strategy for each particular use case.

Apart from these caching capabilities, Denodo contains smart query acceleration techniques based on pre-stored data using an element called Summary. This technique is beyond the scope of this document but you can visit the section ‘Summaries vs Cache’ to understand the differences and help you decide between both depending on your scenario.

Caching

Caching in Denodo can be used for several purposes, such as enhancing performance, protecting data sources from costly queries, and/or reusing complex data combinations and transformations. For this, Denodo Virtual DataPort allows configuring a Cache Engine to store local copies of the data retrieved from the data sources. Denodo can use as a caching engine almost any relational database (you can check the list here). Although Denodo ships with its own caching database (based on Apache Derby) for testing purposes, it is not recommended to use it in production.

This document describes best practices and configuration guidelines for caching in Denodo.

You can find documentation about how to configure the Denodo Virtual DataPort Cache in the section ‘Configuring the Cache’ of the Virtual DataPort Administration Guide.

Selecting the Caching Engine Database

In order to choose the database caching engine, we can consider two types of environments: analytical and operational.

Analytical environments are built to support decision-making. Analysis in this type of environments is based, mainly, on historical data, which makes the use of a cache system useful in many cases.

Taking into account the complexity of analytical environments, it is very likely that the cache system will manage large volumes of data; therefore, we recommend the use of a Massive Parallel Processing (MPP) system as a caching engine. Virtual DataPort supports MPP databases like Netezza, Teradata, HP Vertica and SQL-on-Hadoop MPP query engines like Spark, Impala, Hive or Presto. If an MPP system is not available in your scenario, the recommendation is to use the most powerful system available.

Differently from analytical environments, operational environments focus on the day-to-day operations so queries do not usually involve massive data volumes. This means Massive Parallel Processing systems are typically not needed, and you can use a conventional database instead.

If your scenario involves both types of environments, analytical and operational, you can create different virtual databases in Denodo to group the views of each one. This will allow you to configure a different caching engine for each scenario.

Finally, as a general rule, to obtain the best performance it is critical to ensure a fast data transfer speed between the VDP server and the cache system. To achieve that, they should be in the same network segment.

Co-Locating the Cache with a Data Source 

If it is properly scaled to support the workload, you can consider using one of your main data sources as the Denodo caching engine. This will allow Denodo to maximize query pushdown when combining cached data with other datasets in the same data source. Maximizing query pushdown can greatly enhance performance by minimizing network traffic and leveraging the power of the underlying data sources. This idea is especially attractive if the data source is an MPP system containing large volumes of data.

For example, suppose we have information about 300M sales in an Impala cluster and information about 2M customers in an Oracle database. Moreover, we also have extra information about these customers stored in a SaaS application.

Suppose we want to obtain the total sales by each customer, including information about the customers that is stored in both systems, the Oracle database and the SaaS application. The following figure shows the execution trace for this query when caching is not used:

As you can see in the execution trace above, Denodo Virtual DataPort has to access the three systems in order to resolve the query. However, since accessing the SaaS application is slow (data needs to go through a WAN), the query can take significant time to execute. To avoid this, we can configure caching in the extended_customer view. If we use the Impala cluster as Denodo caching engine, then the optimizer can automatically reorder the join operation so the join between the sales view and the extended_customer view can be pushed down to Impala, as shown in the execution trace below.

This way, Denodo Virtual DataPort gets the query results by accessing only two sources: Impala, configured as a data source and as the caching engine, and Oracle.

To configure Denodo Virtual DataPort to use an existing data source as caching engine follow the steps below:

  1. Configure the cache following the steps registered in the section ‘Configuring the Cache’ of the Virtual DataPort Administration Guide, considering that you have to select the option ‘Specify an existing data source as cache’. Once you have checked this option, you have to select the data source you want to use as the caching engine, as well as the virtual database where you have created this data source.

  1. Go to the Read & Write tab of the data source, impala_data_source in our example, and select the option ‘Specify custom catalog and schema’. Then you must click the refresh button  and select the catalog and schema that the cache will use to create the tables and indexes. Depending on the database, the lists Catalog or Schema can be disabled if the database only has catalogs and not schemas or vice versa.

Make sure the user configured to connect to the selected data source has enough privileges to create and drop tables and indexes in the selected catalog/schema. This user is the one specified in the connection tab of the data source configuration dialog.

It is important to notice that the query optimizer can apply its query pushdown techniques in the same data source even when the data is in different catalogs or schemas.

Co-Locating the Cache with the MPP Acceleration Engine

The co-location of data to maximize query pushdown can be further improved if you have a SQL-on Hadoop system like Impala, Presto or Spark as one of your main data sources, and you allow the Denodo query optimizer to also use it as MPP acceleration system (see the ‘MPP acceleration’ section of the Best Practices to Maximize Performance II: Configuring the Query Optimizer document).

It is important to notice that if one of the data sources in your environment is one of the supported systems for MPP acceleration (Impala, Spark or Presto), you can unify this recommendation with the one from the previous section and use such data source both as caching engine and as MPP acceleration system. This requires two additional steps:

  • Edit the data source to allow the optimizer to create temporary tables for query acceleration (Read & Write tab).

  • Enable the MPP acceleration and select the cache data source as the one to use (Administration -> Server Configuration -> Queries optimization).

Note that the ‘Use of the cache data source’ option will not be the recommended choice if you want to configure a different data source as cache for some particular virtual databases. In this scenario you must use the ‘Use JDBC data source’ option to establish the same source that is configured as cache.

If now we execute again the example from the previous section, the optimizer can decide that the customer view should be also moved to a temporary table in the MPP acceleration engine, so the whole query can be pushed down to the Impala system, as shown in the figure below. Notice that the Denodo optimizer will decide if this data movement is worth on a query-by-query basis using cost estimations.

Estimating the size of the cache database

There are several factors involved in estimating the space required for the cache, and they can be highly dependent on the characteristics of the selected database system. Most DBMS vendors (e.g., Oracle, SQLServer) provide specific tools and guidelines that you can use to estimate the required space. In this section we provide some information about the internals of the Denodo caching system, that should be taken into account during this process:

  • First of all, decide which views you should cache taking into consideration your particular environment. See the What Views should I cache? section for more information regarding this decision.
  • Once you have decided what views you will need to cache, you can estimate the raw data volumes to be stored by using the view statistics collected by Denodo, as well as the information included in the document ‘Cache database size estimate’.
  • After this initial estimate you should take into account that:
  • As a general norm, the allocated space should be higher than twice the size of the calculated data volume. This is because, as explained in the Cache maintenance section, expired/invalidated rows are deleted from the cache by a maintenance process executed periodically. Therefore, expired/invalidated rows will continue to consume space until the next execution of this process.
  • When using the cache engine also for MPP acceleration, keep in mind that the query optimizer can create temporary tables to accelerate some queries so you must keep enough space for these tables (these tables will be automatically removed when query execution ends).

Selecting the Cache Mode

Denodo Virtual DataPort has two main cache modes: partial and full. The former allows the cache to store some of the tuples of the view, unlike the latter where all the data of the view is stored in the cache.

Full vs Partial

The ‘Full cache mode allows delegating to the cache database complex operations (joins, unions, group by …) involving several views (even from different data sources) as long as those views have the cache enabled or they belong to the cache data source. On the other hand, complex operations between views configured with partial cache cannot be pushed to the cache database and therefore the data will be combined in Denodo. Since most queries in analytic environments involve performing join and group by operations across large data volumes, in general it is recommended to use the ‘Full’ mode when caching a view in them.

In turn, in operational scenarios the number of rows processed and the complexity of the operations are usually lower. Besides, the incorporation of new data in the data sources is done with relatively high frequency. Therefore, the ‘Partial’ cache mode can be a more attractive option.

Another difference between the two modes is that, when using the ‘full’ mode, the queries on the view will always use the cached data and will never hit the data source; the data source will be only accessed for refreshing the cache.

In turn, when using the ‘partial’ mode, Denodo will check if the cache contains the data required to answer the query and, in other case, will access the data source directly Therefore, if you want to avoid introducing additional workload in a data source (e.g. a critical operational system), the ‘full’ mode is recommended.

Finally, the ‘partial’ mode does not force you to have all the tuples of the view in the cache. Therefore, it is also well suited when the query capabilities of the data source prevent you to obtain all the data. This happens frequently with web services and other API-based data sources which do not offer a SQL query interface.

Full vs Full Incremental

Furthermore, there is a subtype of the ‘Full’ mode, called ‘Incremental’. The ‘Incremental’ mode allows you to get the data from the cache and merge it at query time with the most recent data from the source. Data retrieval from the source is based on a condition like ‘last_modified > @LASTCACHEREFRESH’, where last_modified refers to a column in the base view representing the last time the row was updated, and @LASTCACHEREFRESH is a variable that Denodo will fill in at runtime with the last time the cache was refreshed.

The main benefit of this mode is that the queries will return more up to date results without needing to retrieve again the full result set from the data source, just the rows that were added/changed since the last cache refresh. This means that queries are still sent to the source in real-time but network traffic is minimized.

However, since the 'Incremental' mode implies joining the cached data and the new/updated rows from the original data source, this can prevent the optimizer from completely delegating complex operations such as joins or aggregations to the cache, losing one of the main benefits of the ‘full’ mode in analytic scenarios.  

Therefore, the incremental mode is usually a good solution to cache base views that are created over high latency sources, such as SaaS applications, when we also want to obtain up-to-date data, and we do not need to delegate complex operations to the cache database. Check the ‘Incremental Queries’ guide for additional information and examples of the ‘incremental queries’ option.

Finally, notice that when a query needs data from the incremental cache, Denodo will access both systems to merge the data but it will not load the cache. This means that you still need to schedule periodic cache loads (see section Refreshing the Cache).

Partial with “Explicit” Loads vs Partial with “Implicit” Loads

As explained previously, the “Partial” cache mode is typically used in operational scenarios so result sets are typically relatively small.

When using the 'explicit' option of this mode, you need to configure cache refresh processes. This option is a good fit when a relatively small subset of the data is queried much more frequently than the rest and, in addition, it is easy to predict which subset is the most frequently queried. Product databases constitute a typical example: in many cases, a few popular products are the subject of the majority of queries, reflecting the popular “80-20” rule. Explicitly preloading the data of these popular products can be very effective.

In turn if the ‘implicit’ option is selected, the cache will be automatically refreshed as the users execute queries, so no additional configuration is needed beyond setting a suitable Time to live (TTL) for the data. The 'implicit' option performs especially well when the applications show high temporal locality (recently queried information has a higher probability of being queried again). For instance, in many web applications it is frequent that the user visits the same page several times during the same session. Implicit loads also avoid the administrator having to decide in advance what data should be cached when it is not easy to predict which data items will be ‘popular’.

Keep also in mind that 'implicit' loads insert in the cache the results of all the queries that miss the cache. This can produce significant workload in the cache database, especially when the expiration times are short. At peak load periods, this can also increase the number of blockings in the cache database, and therefore increase cache read times. That is why, when both options are adequate for the business case, explicit loads are preferred.

For more details about how to configure cache modes in Denodo, you can also check the sections ‘Full Mode’, ‘Partial Mode’ and ‘Incremental Mode of the Virtual DataPort Administration Guide. 

Refreshing the Cache

Cached data is liable to become out of date; therefore, you may need to establish a data refresh plan. Since you will have to execute the refresh plan periodically, we recommend using the Denodo Scheduler.

Partial mode

As explained in the previous section, cache refresh processes are not needed when you are using the ‘implicit’ option, so you only need to configure them when using the 'explicit' option.  In that case, you can use the ‘VDP Cache’ tasks of  Denodo Scheduler to periodically refresh the cache content.

In both cases, you can also selectively invalidate cached data by using the ‘VDP Cache’ Scheduler tasks or the ‘cache_invalidate’ option of the CONTEXT clause (see the section ‘CONTEXT Clause’ of the Virtual DataPort Administration Guide).

Check the section ‘Partial Cache’ of the Virtual DataPort Administration Guide for more information.

Full mode

Using the ‘full’ mode, you always have to fill in the cache with explicit loads. After that, you need to refresh the data periodically.

Denodo provides several incremental cache refresh methods that you should use whenever possible. Nevertheless, these strategies are subject to certain restrictions described in the subsection below. For views which do not verify these restrictions, you can configure full refresh processes using Denodo Scheduler, invalidating all cached data and loading it again. Given that these operations can involve large datasets you can check the Optimizing Cache Load Processes section for tips about optimizing the performance of these processes.

Incremental strategies

Incremental cache refresh processes minimize the amount of data that needs to be moved from the data source to the cache. Nevertheless, they can only be applied when certain requirements are met. This section describes the different options available and the restrictions that apply to each one.

 

An incremental refresh can be accomplished by Denodo using one of these two methods:

  1. Invalidating the rows matching a WHERE condition and then adding the result of the query to the cache database (see the section ‘Loading the Cache Incrementally’ of the Virtual DataPort Administration Guide).
  • For instance, suppose a view ‘customer’ whose content is cached. This method can be used to update the cached data for some particular customers of the view.
  • This method is useful for incremental refresh processes based on content, where you refresh the cache every time data changes. 

  1. Using the Denodo Incremental Cache Load Stored Procedure. This component can be used to retrieve from the data source the new/updated rows since the last time the cache was refreshed, and to merge them with the cached content.
  • This component is available in the Downloads > DenodoConnects section of the Support Site, and includes a detailed user guide.
  • This method is useful for incremental refresh processes based on time, where you schedule regular cache loads to refresh the data that have changed in that period.

Incremental refresh based on content

This incremental method refreshes the cached data every time it changes in the original source. Therefore, it requires that you can receive notifications when content changes in the data source and that you can identify the altered data with a WHERE condition.  

For instance, if there exists a process sending a JMS message every time the information about a customer changes, you can use a JMS listener in Denodo to obtain the customer identifier, and then send a query to Denodo to update the changed data using a query like:

SELECT * FROM customer WHERE customer_id = '@JMSEXPRESSION'

CONTEXT (
'cache_preload' = 'true'
, 'cache_invalidate' = 'matching_rows'
, 'cache_wait_for_load' = 'true'
, 'cache_return_query_results' = 'false')

Where @JMSEXPRESSION is a special variable containing the JMS message (see  the section ‘Create a New JMS Listener’ of the Virtual DataPort Administration Guide).

This will work if the JMS message directly provides the identifier of the changed row. If the identifier of the message is encoded in some way, you may need to add an expression to the query to obtain the identifier. For instance, if the identifier is encoded inside an XML document, you can use the Denodo XPATH function to obtain it.

In the common case where the JMS message includes the complete changed record in XML or JSON format, we can use directly the received data. This would work as follows:

  1. Create a new XML or JSON base view using the ‘From Variable’ option to parse the message (check this document for a detailed example of how to use the ‘From Variable’ option). For instance, let’s assume this view is called ‘customer’ and receives the input message using the field called ‘xml_message’.
  2. The JMS Expression should then invoke this view, assigning the value '@JMSEXPRESSION' to the view field that receives the input message. For instance:

SELECT * FROM customer WHERE xml_message = '@JMSEXPRESSION'

CONTEXT (
'cache_preload' = 'true'
, 'cache_invalidate' = 'matching_rows'
, 'cache_wait_for_load' = 'true'
, 'cache_return_query_results' = 'false')

Notice that this strategy may result in multiple small cache loads, which can cause performance problems in the cache when changes are very frequent.

Incremental refresh based on time

This incremental method does not refresh the cache on every change. Instead, it schedules regular loads to refresh the data that have changed in that time window.

This type of refresh requires the view to include a column indicating when each row was last added/updated. Notice that this option can only be used with views where rows are never deleted. In other cases, the refresh process will add/update the changed rows but the cache will still contain the rows removed in the data source.

We can distinguish 3 different cases for time-based incremental refresh:

  • If the view contains a primary key and you are not using a Hadoop-based system (Hive, Impala, Spark, or Presto) as your caching engine, there are no additional restrictions. When configuring the Denodo Incremental Cache Load Stored Procedure, you can specify ‘last_modified > @LASTCACHEREFRESH’ as the condition for retrieving the new/updated rows from the data source, and the procedure will internally fill in the @LASTCACHEREFRESH variable with the last time the cache was refreshed for that view.

  • If the view contains a primary key but you are using a Hadoop-based system as the caching engine, then this method can only be used for views where rows are never updated in the data source. This is required because of the limitations to execute UPDATE statements in Hadoop-based systems. As long as these restrictions are satisfied (i.e. rows can be added but never updated/deleted), you can still use the Denodo Incremental Cache Load Stored Procedure in the same way as in the previous point

  • If the view does not contain a primary key then incremental refresh based on time is again only possible if rows in the view are never updated. This is because the primary key is needed to decide whether a row from the cache and a row from the source are the same. In this scenario you cannot use the Denodo Incremental Cache Load Stored Procedure, so you should execute a load query specifying a WHERE condition over the column indicating when the row was created.

Slowly changing dimensions

Views that are the result of modeling a slowly changing dimension deserve a special analysis:

  • If your queries only need to access the current data and you have followed the modeling recommendations in Denodo, you will have a derived view containing a WHERE condition to obtain only the current rows. In this case, you should cache this derived view and use the Denodo Incremental Cache Load Stored Procedure to carry out incremental refreshes over it.

  • If your queries also need to access the historical data, the best strategy is different depending on your approach to model slowly changing dimensions:
  • If your tables include columns like start_date and end_date to specify the period when a row was valid, then you can use the Denodo Incremental Cache Load Stored Procedure specifying a condition like ‘start_date > @LASTCACHEREFRESH OR end_date > @LASTCACHEREFRESH’. The first part of the condition will load the current data and the last part will load the modified data that has changed its status from current to historical.
  • If your tables only include a column like ‘current_flag’ which takes a value like ‘Y’ for the current value or ‘N’ for the old ones, you should follow the general recommendations of the ‘Full mode’ section.

Optimizing Cache Load Processes

In this section we review how to optimize performance when caching large datasets. This is especially important for initial cache loads and also for refreshing the cache when the incremental strategies described above are not applicable.  

Configuring Bulk Data Load

To store data in the cache database, Denodo uses by default the ‘batch insert’ capabilities offered by the JDBC driver of the database.  However, most of the databases also provide a proprietary interface, usually called bulk load API, to optimize loading big amounts of data. Note that typically these interfaces are worth using only when the number of rows to insert is in the range of tens of thousands or higher, since there is usually not performance improvement with a lower number of rows.

Although Virtual DataPort is capable of using bulk load APIs for the majority of databases, using this option may imply limitations on the atomicity of the cache loading operation (see section Deciding whether to use Atomic Cache Loads for more information).

You can see more information regarding bulk data load and how to configure it in the section ‘Bulk Data Load’ of the Virtual DataPort Administration Guide.

Deciding whether to use Atomic Cache Loads

By default, the following two tasks during cache loads are performed in a single database transaction: 1) invalidating existing data in the cache table and 2) changing the status of the new rows that have been inserted/updated to ‘valid’, so they become visible to client applications.

 

This ensures that client applications never see inconsistent data while cache loads are in progress: they see ‘old’ cache content until the new load process is completely finished and, at that point, they immediately see all the new cached content.

Nevertheless, this also comes with some disadvantages when caching large datasets:

  • Atomic loads can be slower. In atomic loads the process of invalidating previous rows and making visible the new ones needs to be done in a single transaction. If this operation involves large amounts of data, it can take significant time.  
  • Atomic loads can also require a significant amount of disk space for the transaction logs (UNDO and REDO logs) of the caching engine database, when the load/invalidation process involves large data volumes.

Therefore, Denodo also offers the possibility of executing non-atomic cache load processes (see section ‘Caching Very Large Data Sets’ of the Virtual DataPort Administration Guide).  This is the recommended option when you need to cache a large dataset and you can execute the cache load process at a moment where there will not be user queries.

Also notice that in caches configured over a Hadoop-based system it is necessary to use non-atomic loads because of limitations of the data source itself.

Using Parallel Cache Loads 

If the dataset to cache can be divided into disjoint subsets using query conditions, you can execute several cache loading processes in parallel to speed up the process.

For instance, suppose you want to cache a very large view which includes a column indicating the timestamp when the row was created.  Instead of performing the complete cache load in a single process, you can execute different cache loading processes in parallel, each process loading the rows created in a certain time period (e.g. one year).

Keep in mind that this strategy introduces additional workload on both the data sources and the caching engine. Therefore, an excessive degree of parallelism can negatively affect other applications.  This also means that parallelism will not always improve the performance of the cache loading process and may even degrade it.

Therefore, we recommend using parallelism only when strictly needed and using a low number of parallel processes (e.g. 2-4). You can use Denodo Scheduler to schedule and execute the different load processes

Summaries vs cache

Apart from the traditional cache, Denodo 8 includes smart query acceleration techniques using a new type of view called Summary. Summaries store common intermediate results that the query optimizer can then use as a starting point to accelerate analytical queries (visit section ‘Smart query acceleration using summaries’ at the end of the second document of these series to see some examples). Summaries have some advantages over the traditional cache:

  • Users need to explicitly reference the cached view in order to use the precomputed data. In turn, summaries are transparently used by the query optimizer, that analyzes and rewrites the incoming queries to take advantage of the data in the summary whenever possible.
  • Summaries are not restricted to the cache data source. This means that when you create a summary you can choose the best data source alternative for that particular data (to maximize colocation for example). You can also store the same data in multiple data sources so the optimizer can choose the best option for each case.

On the other hand:

  • Caching does not require a preprocessing phase to decide whether to use it or not. So in cases where you know you always want to access a certain data in the same way it is better to continue using tradicional cache, to avoid any unnecessary overhead.
  • Caching offers more refresh and management options, such as: implicit caching mode, incremental queries, atomic or non-atomic loads, selective invalidation.

In order to decide what is best for each case it is useful to take a look at the different abstraction layers you find when you model a logical architecture (See image below):

  • Base or physical layer, containing the original datasets
  • Integration layer, containing the intermediate transformations and combinations needed to create the views exposed to consumers
  • Semantic or business layer, containing the canonical views of your model representing your business entities. The views in this layer are typically exposed to data consumers, who can perform ad-hoc queries over them.
  • Reporting layer, containing pre-canned reports with some calculated metrics and final views you want to publish to certain clients.

You can see how the differences between caching and summaries make them more suitable for some of these modeling layers and not others. Therefore the general recommendation is to:

  • Continue using caching:
  • for final, pre-canned reports, with calculated metrics ready to be consumed by users (Reporting layer). In these cases there are no further transformations and combinations beyond applying projections and filters, so you don’t need the flexibility of the summaries.  
  • for intermediate views which are not directly exposed to the consumers. Having to explicitly reference cached views is not a problem in this case since only the developers use them directly (Integration layer), and caching offers more refresh and management options than summaries.
  • Create summaries over the canonical views of your model (semantic and business layer) exposed to data consumers such as business users or data scientists, which need to execute ad-hoc queries joining and aggregating different views in a self-service way.  

For further information about this smart query acceleration feature visit the ‘Summaries’ section in the Denodo Virtual DataPort Administration Guide.

What Views should I cache?

To improve the performance using the cache engine you need to assess when real time access and federation is not a good fit. A bad caching policy will not necessarily improve performance. For instance, if you cache data from a local database, you are simply copying data from a local system to another. In that case, queries will not be faster unless the caching system is more powerful than the original data source, or you increment the options for query pushdown by caching datasets from several data sources.

In general terms, it is recommended to cache:

  • Views that have been created over slow or high latency sources (e.g. SaaS data sources).
  • When you have a complex combination or transformation that you are going to use frequently or that it is going to be used by other views, you can create an intermediate view and cache its contents. For instance, let’s imagine you have customer information distributed in two tables. Let’s also suppose that in order to get the customer data in the desired format, you need to perform a join operation and a set of costly regular expressions in the key fields that take significant time. You may create an intermediate view ‘customer’ applying these operations and then cache it. All the subsequent requests where the view ‘customer’ is involved will hit the cache achieving a faster response. If a view is frequently combined with a dataset in the caching engine database (for instance because you are following the co-location recommendations in the section above), consider caching it to maximize query pushdown.
  • Views that have been created over sources with limited access or where you have to mitigate the impact of costly queries on them (e.g. operational sources).

Finally, take into account that since version 8.0 Denodo offers a new element for smart query acceleration called summaries. You can visit the previous section ‘Summaries vs cache’ to help you decide what is the best approach to use for each case.

Configuring Caching for Partitioned Unions

In partitioned UNION views, where the data for the facts table is located in several different systems, Denodo suggests the creation of an intermediate selection view over each partition where you provide information about the partition criterion, and then create the UNION of these intermediate selections. This way, Denodo will only access the necessary partitions based on the conditions of each query. This optimization is called “Branch pruning”.

Therefore, if you decide to cache the partitioned UNION, we recommend to cache the views under the intermediate selections instead of the union view so you can still take advantage of the automatic pruning when you access the cached data.

For example, in a retailer company you can have information about sales in two systems:

  • An enterprise data warehouse that contains the sales data from the current year
  • A data lake based on Hadoop that contains the sales data from previous years

In order to create a partitioned union with the unified sales information you must:

First create a selection view over the sales data from the current year, sales_new_selection over sales_new, and another selection view over the sales data from previous years, sales_old_selection over sales_old. In both cases you should add a WHERE condition with the partition criterion.

The next step is to create a canonical view, sales, as the union of sales_new_selection and sales_old_selection.

The canonical view, sales, containing all the sales information may be very big and if you cache it you will not benefit from the “Branch Pruning” optimization. If on the other hand, you cache the individual views (sales_new and sales_old in our example) depending on the query Denodo can: access just sales_new (for queries asking for data from this year), or access just sales_old (for queries asking for data from past years), or push the UNION to the cache system if it need to get data from both partitions.

This approach also allows different refresh rates for the views at each branch of the union. This can be useful in scenarios where you have historical and current data, since the latter may need a higher refresh rate.

However, management may be more complicated. For instance, you may need a different cache load process for each branch and when you want to create an index you may need to create it in all the branches.  

For more information about removing redundant branches check the section ‘Removing Redundant Branches of Queries (Partitioned Unions)’ of the Virtual DataPort Administration Guide.

Adding Indexes to the Cache

Indexes improve the speed of data retrieval operations on database tables. When enabling the cache for a view, the Cache Module creates a table in the cache database that will store the cached data of this view. You can define cache indexes for this view to take advantage of the indexes benefits (see ‘Cache indexes’ in the section ‘Configuring the Cache of a View’ of the Virtual DataPort Administration Guide).

Take into account that Denodo executes a CREATE INDEX statement to create the indexes in the cache database. This statement will create an index with the type considered by default for the database specified as cache. In most databases, this is a B-Tree index. The guidelines to define what indexes to create in the cache are the same as with any other table in a relational database.

Cache maintenance

When the content of a cached view is invalidated or reaches its time to live, expired/invalidated rows will not appear in query results. Nevertheless, Denodo does not immediately delete them from the physical tables of the database. In turn, Denodo provides a cache maintenance process that periodically removes expired/invalidated rows. It is important to properly configure the execution of this process to avoid performance degradation and excessive storage consumption in the database.

The general settings of the cache configuration allow you to activate the ‘Maintenance’ option. This option will execute a maintenance task periodically but, in production environments, we strongly recommend setting this option to ‘Off’ and, instead, programming the execution of the CLEAN_CACHE_DATABASE predefined stored procedure using the Denodo Scheduler. This has the advantage of allowing you to decide the exact time when the maintenance process is run. This  procedure deletes the invalid or expired rows from the tables that hold the cached data of each view, and also the tables created as temporary (see the section ‘Temporary Tables’ of the Virtual DataPort VQL Guide).

For performance reasons, the execution of a maintenance task should be at periods where the workload in the Server and the cache database is expected to be low. One daily execution is usually enough.

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training