This document is the third part of a series of articles about how to achieve the best performance while working with the Denodo Platform.
These articles will provide best practices on a variety of different topics such as the way you should build your model, the different configuration settings you should use and the caching method that best suits your scenario. This guidance is especially important in analytic use cases which require integrating large volumes of data from different sources.
In the first document of this series you can find general guidelines on how to design views that are optimal from a performance point of view. The second document contains details regarding the meta-information and configuration settings connected to the Query Optimizer. This third article is aimed at Data Architects and Developers interested in taking advantage of the Denodo caching capabilities. It contains recommendations for different aspects of the Cache Module such as, how to choose the cache database, how to decide what views to cache, or what is the best cache mode and refresh strategy for each particular use case.
Apart from these caching capabilities, Denodo contains smart query acceleration techniques based on pre-stored data using an element called Summary. This technique is beyond the scope of this document but you can visit the section ‘Summaries vs Cache’ to understand the differences and help you decide between both depending on your scenario.
Caching in Denodo can be used for several purposes, such as enhancing performance, protecting data sources from costly queries, and/or reusing complex data combinations and transformations. For this, Denodo Virtual DataPort allows configuring a Cache Engine to store local copies of the data retrieved from the data sources. Denodo can use as a caching engine almost any relational database (you can check the list here). Although Denodo ships with its own caching database (based on Apache Derby) for testing purposes, it is not recommended to use it in production.
This document describes best practices and configuration guidelines for caching in Denodo.
You can find documentation about how to configure the Denodo Virtual DataPort Cache in the section ‘Configuring the Cache’ of the Virtual DataPort Administration Guide.
In order to choose the database caching engine, we can consider two types of environments: analytical and operational.
Analytical environments are built to support decision-making. Analysis in this type of environments is based, mainly, on historical data, which makes the use of a cache system useful in many cases.
Taking into account the complexity of analytical environments, it is very likely that the cache system will manage large volumes of data; therefore, we recommend the use of a Massive Parallel Processing (MPP) system as a caching engine. Virtual DataPort supports MPP databases like Netezza, Teradata, HP Vertica and SQL-on-Hadoop MPP query engines like Spark, Impala, Hive or Presto. If an MPP system is not available in your scenario, the recommendation is to use the most powerful system available.
Differently from analytical environments, operational environments focus on the day-to-day operations so queries do not usually involve massive data volumes. This means Massive Parallel Processing systems are typically not needed, and you can use a conventional database instead.
If your scenario involves both types of environments, analytical and operational, you can create different virtual databases in Denodo to group the views of each one. This will allow you to configure a different caching engine for each scenario.
Finally, as a general rule, to obtain the best performance it is critical to ensure a fast data transfer speed between the VDP server and the cache system. To achieve that, they should be in the same network segment.
If it is properly scaled to support the workload, you can consider using one of your main data sources as the Denodo caching engine. This will allow Denodo to maximize query pushdown when combining cached data with other datasets in the same data source. Maximizing query pushdown can greatly enhance performance by minimizing network traffic and leveraging the power of the underlying data sources. This idea is especially attractive if the data source is an MPP system containing large volumes of data.
For example, suppose we have information about 300M sales in an Impala cluster and information about 2M customers in an Oracle database. Moreover, we also have extra information about these customers stored in a SaaS application.
Suppose we want to obtain the total sales by each customer, including information about the customers that is stored in both systems, the Oracle database and the SaaS application. The following figure shows the execution trace for this query when caching is not used:
As you can see in the execution trace above, Denodo Virtual DataPort has to access the three systems in order to resolve the query. However, since accessing the SaaS application is slow (data needs to go through a WAN), the query can take significant time to execute. To avoid this, we can configure caching in the extended_customer view. If we use the Impala cluster as Denodo caching engine, then the optimizer can automatically reorder the join operation so the join between the sales view and the extended_customer view can be pushed down to Impala, as shown in the execution trace below.
This way, Denodo Virtual DataPort gets the query results by accessing only two sources: Impala, configured as a data source and as the caching engine, and Oracle.
To configure Denodo Virtual DataPort to use an existing data source as caching engine follow the steps below:
Make sure the user configured to connect to the selected data source has enough privileges to create and drop tables and indexes in the selected catalog/schema. This user is the one specified in the connection tab of the data source configuration dialog.
It is important to notice that the query optimizer can apply its query pushdown techniques in the same data source even when the data is in different catalogs or schemas.
The co-location of data to maximize query pushdown can be further improved if you have a SQL-on Hadoop system like Impala, Presto or Spark as one of your main data sources, and you allow the Denodo query optimizer to also use it as MPP acceleration system (see the ‘MPP acceleration’ section of the Best Practices to Maximize Performance II: Configuring the Query Optimizer document).
It is important to notice that if one of the data sources in your environment is one of the supported systems for MPP acceleration (Impala, Spark or Presto), you can unify this recommendation with the one from the previous section and use such data source both as caching engine and as MPP acceleration system. This requires two additional steps:
Note that the ‘Use of the cache data source’ option will not be the recommended choice if you want to configure a different data source as cache for some particular virtual databases. In this scenario you must use the ‘Use JDBC data source’ option to establish the same source that is configured as cache.
If now we execute again the example from the previous section, the optimizer can decide that the customer view should be also moved to a temporary table in the MPP acceleration engine, so the whole query can be pushed down to the Impala system, as shown in the figure below. Notice that the Denodo optimizer will decide if this data movement is worthwhile on a query-by-query basis using cost estimations.
There are several factors involved in estimating the space required for the cache, and they can be highly dependent on the characteristics of the selected database system. Most DBMS vendors (e.g., Oracle, SQLServer) provide specific tools and guidelines that you can use to estimate the required space. In this section we provide some information about the internals of the Denodo caching system, that should be taken into account during this process:
Denodo Virtual DataPort has two main cache modes: partial and full. The former allows the cache to store some of the tuples of the view, unlike the latter where all the data of the view is stored in the cache.
The ‘Full’ cache mode allows delegating to the cache database complex operations (joins, unions, group by …) involving several views (even from different data sources) as long as those views have the cache enabled or they belong to the cache data source. On the other hand, complex operations between views configured with partial cache cannot be pushed to the cache database and therefore the data will be combined in Denodo. Since most queries in analytic environments involve performing join and group by operations across large data volumes, in general it is recommended to use the ‘Full’ mode when caching a view in them.
In turn, in operational scenarios the number of rows processed and the complexity of the operations are usually lower. Besides, the incorporation of new data in the data sources is done with relatively high frequency. Therefore, the ‘Partial’ cache mode can be a more attractive option.
Another difference between the two modes is that, when using the ‘full’ mode, the queries on the view will always use the cached data and will never hit the data source; the data source will be only accessed for refreshing the cache.
In turn, when using the ‘partial’ mode, Denodo will check if the cache contains the data required to answer the query and, in other case, will access the data source directly Therefore, if you want to avoid introducing additional workload in a data source (e.g. a critical operational system), the ‘full’ mode is recommended.
Finally, the ‘partial’ mode does not force you to have all the tuples of the view in the cache. Therefore, it is also well suited when the query capabilities of the data source prevent you from obtaining all the data. This happens frequently with web services and other API-based data sources which do not offer a SQL query interface.
Furthermore, there is a subtype of the ‘Full’ mode, called ‘Incremental’. The ‘Incremental’ mode allows you to get the data from the cache and merge it at query time with the most recent data from the source. Data retrieval from the source is based on a condition like ‘last_modified > @LASTCACHEREFRESH’, where last_modified refers to a column in the base view representing the last time the row was updated, and @LASTCACHEREFRESH is a variable that Denodo will fill in at runtime with the last time the cache was refreshed.
The main benefit of this mode is that the queries will return more up to date results without needing to retrieve again the full result set from the data source, just the rows that were added/changed since the last cache refresh. This means that queries are still sent to the source in real-time but network traffic is minimized.
However, since the 'Incremental' mode implies joining the cached data and the new/updated rows from the original data source, this can prevent the optimizer from completely delegating complex operations such as joins or aggregations to the cache, losing one of the main benefits of the ‘full’ mode in analytic scenarios.
Therefore, the incremental mode is usually a good solution to cache base views that are created over high latency sources, such as SaaS applications, when we also want to obtain up-to-date data, and we do not need to delegate complex operations to the cache database. Check the ‘Incremental Queries’ guide for additional information and examples of the ‘incremental queries’ option.
Finally, notice that when a query needs data from the incremental cache, Denodo will access both systems to merge the data but it will not load the cache. This means that you still need to schedule periodic cache loads (see section ‘Refreshing the Cache’).
As explained previously, the “Partial” cache mode is typically used in operational scenarios so result sets are typically relatively small.
When using the 'explicit' option of this mode, you need to configure cache refresh processes. This option is a good fit when a relatively small subset of the data is queried much more frequently than the rest and, in addition, it is easy to predict which subset is the most frequently queried. Product databases constitute a typical example: in many cases, a few popular products are the subject of the majority of queries, reflecting the popular “80-20” rule. Explicitly preloading the data of these popular products can be very effective.
In turn if the ‘implicit’ option is selected, the cache will be automatically refreshed as the users execute queries, so no additional configuration is needed beyond setting a suitable Time to live (TTL) for the data. The 'implicit' option performs especially well when the applications show high temporal locality (recently queried information has a higher probability of being queried again). For instance, in many web applications it is frequent that the user visits the same page several times during the same session. Implicit loads also avoid the administrator having to decide in advance what data should be cached when it is not easy to predict which data items will be ‘popular’.
Keep also in mind that 'implicit' loads insert in the cache the results of all the queries that miss the cache. This can produce significant workload in the cache database, especially when the expiration times are short. At peak load periods, this can also increase the number of blockings in the cache database, and therefore increase cache read times. That is why, when both options are adequate for the business case, explicit loads are preferred.
Cached data is liable to become out of date; therefore, you may need to establish a data refresh plan. Since you will have to execute the refresh plan periodically, we recommend using the Denodo Scheduler.
As explained in the previous section, cache refresh processes are not needed when you are using the ‘implicit’ option, so you only need to configure them when using the 'explicit' option. In that case, you can use the ‘VDP Cache’ tasks of Denodo Scheduler to periodically refresh the cache content.
In both cases, you can also selectively invalidate cached data by using the ‘VDP Cache’ Scheduler tasks or the ‘cache_invalidate’ option of the CONTEXT clause (see the section ‘CONTEXT Clause’ of the Virtual DataPort Administration Guide).
Check the section ‘Partial Cache’ of the Virtual DataPort Administration Guide for more information.
Using the ‘full’ mode, you always have to fill in the cache with explicit loads. After that, you need to refresh the data periodically.
Denodo provides several incremental cache refresh methods that you should use whenever possible. Nevertheless, these strategies are subject to certain restrictions described in the subsection below. For views which do not verify these restrictions, you can configure full refresh processes using Denodo Scheduler, invalidating all cached data and loading it again. Given that these operations can involve large datasets you can check the ‘Optimizing Cache Load Processes’ section for tips about optimizing the performance of these processes.
Incremental cache refresh processes minimize the amount of data that needs to be moved from the data source to the cache. Nevertheless, they can only be applied when certain requirements are met. This section describes the different options available and the restrictions that apply to each one.
In order to refresh the cache incrementally it is necessary to identify the content to update. Depending on this criteria we can classify the refresh in two approaches:
Let’s see these two approaches in a bit more detail.
This incremental method consists of scheduling regular loads to refresh the data that have changed in that time window. In order to do that, you can create a new VDPCache job using Denodo Scheduler. On the extraction section of that job, you can specify either a column representing the last modification date, or a custom condition allowing Denodo to identify the new rows that have been added/modified in the source. In addition, you need to select the invalidation criteria for the existing cache content.
The following decision tree can help you decide what are the best options for your scenario (see below for an explanation of the meaning of the main boxes of the diagram):
To use this option, you need to select the ‘Incremental update based on PK’ invalidation option.
d_year > getyear('\@LAST_REFRESH_DATE') and d_moy > getmonth('\@LAST_REFRESH_DATE')
As a general recommendation, if you need to invalidate rows select ‘incremental update based on PK’ whenever is possible as it is the most efficient way to refresh the cache. Notice that option is only shown if the view has an explicit primary key defined (See image below).
In addition, take into account that option is only possible if the cache data source supports UPDATE statements.
There are a few corner cases where this option is not supported even for data sources supporting UPDATE statements. In those cases you can still update based on the primary key following a different approach explained in the VDPCache extractor section of the Scheduler documentation. You can also visit section ‘Loading the Cache Incrementally’ of the Virtual DataPort Administration Guide for further details.
Views that are the result of modeling a slowly changing dimension deserve a special analysis:
This incremental method refreshes the cached data every time it changes in the original source. Therefore, it requires that you can receive notifications when content changes in the data source and that you can identify the altered data with a WHERE condition.
For instance, if there exists a process sending a JMS message every time the information about a customer changes, you can use a JMS listener in Denodo to obtain the customer identifier, and then send a query to Denodo to update the changed data using a query like:
SELECT * FROM customer WHERE customer_id = '@JMSEXPRESSION'
Where @JMSEXPRESSION is a special variable containing the JMS message (see the section ‘Create a New JMS Listener’ of the Virtual DataPort Administration Guide).
This will work if the JMS message directly provides the identifier of the changed row. If the identifier of the message is encoded in some way, you may need to add an expression to the query to obtain the identifier. For instance, if the identifier is encoded inside an XML document, you can use the Denodo XPATH function to obtain it.
In the common case where the JMS message includes the complete changed record in XML or JSON format, we can use directly the received data. This would work as follows:
SELECT * FROM customer WHERE xml_message = '@JMSEXPRESSION'
Notice that this strategy may result in multiple small cache loads, which can cause performance problems in the cache when changes are very frequent.
In this section we review how to optimize performance when caching large datasets. This is especially important for initial cache loads and also for refreshing the cache when the incremental strategies described above are not applicable.
To store data in the cache database, Denodo uses by default the ‘batch insert’ capabilities offered by the JDBC driver of the database. However, most of the databases also provide a proprietary interface, usually called bulk load API, to optimize loading big amounts of data. Note that typically these interfaces are worth using only when the number of rows to insert is in the range of tens of thousands or higher, since there is usually not performance improvement with a lower number of rows.
Although Virtual DataPort is capable of using bulk load APIs for the majority of databases, using this option may imply limitations on the atomicity of the cache loading operation (see section ‘Deciding whether to use Atomic Cache Loads’ for more information).
You can see more information regarding bulk data load and how to configure it in the section ‘Bulk Data Load’ of the Virtual DataPort Administration Guide.
By default, the following two tasks during cache loads are performed in a single database transaction: 1) invalidating existing data in the cache table and 2) changing the status of the new rows that have been inserted/updated to ‘valid’, so they become visible to client applications.
This ensures that client applications never see inconsistent data while cache loads are in progress: they see ‘old’ cache content until the new load process is completely finished and, at that point, they immediately see all the new cached content.
Nevertheless, this also comes with some disadvantages when caching large datasets:
Therefore, Denodo also offers the possibility of executing non-atomic cache load processes (see section ‘Caching Very Large Data Sets’ of the Virtual DataPort Administration Guide). This is the recommended option when you need to cache a large dataset and you can execute the cache load process at a moment where there will not be user queries.
Also notice that in caches configured over a Hadoop-based system it is necessary to use non-atomic loads because of limitations of the data source itself.
If the dataset to cache can be divided into disjoint subsets using query conditions, you can execute several cache loading processes in parallel to speed up the process.
For instance, suppose you want to cache a very large view which includes a column indicating the timestamp when the row was created. Instead of performing the complete cache load in a single process, you can execute different cache loading processes in parallel, each process loading the rows created in a certain time period (e.g. one year).
Keep in mind that this strategy introduces additional workload on both the data sources and the caching engine. Therefore, an excessive degree of parallelism can negatively affect other applications. This also means that parallelism will not always improve the performance of the cache loading process and may even degrade it.
Therefore, we recommend using parallelism only when strictly needed and using a low number of parallel processes (e.g. 2-4). You can use Denodo Scheduler to schedule and execute the different load processes
Apart from the traditional cache, Denodo 8 includes smart query acceleration techniques using a new type of view called Summary. Summaries store common intermediate results that the query optimizer can then use as a starting point to accelerate analytical queries (visit section ‘Smart query acceleration using summaries’ at the end of the second document of these series to see some examples). Summaries have some advantages over the traditional cache:
On the other hand:
In order to decide what is best for each case it is useful to take a look at the different abstraction layers you find when you model a logical architecture (See image below):
You can see how the differences between caching and summaries make them more suitable for some of these modeling layers and not others. Therefore the general recommendation is to:
For further information about this smart query acceleration feature visit the ‘Summaries’ section in the Denodo Virtual DataPort Administration Guide.
To improve the performance using the cache engine you need to assess when real time access and federation is not a good fit. A bad caching policy will not necessarily improve performance. For instance, if you cache data from a local database, you are simply copying data from a local system to another. In that case, queries will not be faster unless the caching system is more powerful than the original data source, or you increment the options for query pushdown by caching datasets from several data sources.
In general terms, it is recommended to cache:
Finally, take into account that since version 8.0 Denodo offers a new element for smart query acceleration called summaries. You can visit the previous section ‘Summaries vs cache’ to help you decide what is the best approach to use for each case.
In partitioned UNION views, where the data for the facts table is located in several different systems, Denodo suggests the creation of an intermediate selection view over each partition where you provide information about the partition criterion, and then create the UNION of these intermediate selections. This way, Denodo will only access the necessary partitions based on the conditions of each query. This optimization is called “Branch pruning”.
Therefore, if you decide to cache the partitioned UNION, we recommend to cache the views under the intermediate selections instead of the union view so you can still take advantage of the automatic pruning when you access the cached data.
For example, in a retailer company you can have information about sales in two systems:
In order to create a partitioned union with the unified sales information you must:
First create a selection view over the sales data from the current year, sales_new_selection over sales_new, and another selection view over the sales data from previous years, sales_old_selection over sales_old. In both cases you should add a WHERE condition with the partition criterion.
The next step is to create a canonical view, sales, as the union of sales_new_selection and sales_old_selection.
The canonical view, sales, containing all the sales information may be very big and if you cache it you will not benefit from the “Branch Pruning” optimization. If on the other hand, you cache the individual views (sales_new and sales_old in our example) depending on the query Denodo can: access just sales_new (for queries asking for data from this year), or access just sales_old (for queries asking for data from past years), or push the UNION to the cache system if it need to get data from both partitions.
This approach also allows different refresh rates for the views at each branch of the union. This can be useful in scenarios where you have historical and current data, since the latter may need a higher refresh rate.
However, management may be more complicated. For instance, you may need a different cache load process for each branch and when you want to create an index you may need to create it in all the branches.
For more information about removing redundant branches check the section ‘Removing Redundant Branches of Queries (Partitioned Unions)’ of the Virtual DataPort Administration Guide.
Indexes improve the speed of data retrieval operations on database tables. When enabling the cache for a view, the Cache Module creates a table in the cache database that will store the cached data of this view. You can define cache indexes for this view to take advantage of the indexes benefits (see ‘Cache indexes’ in the section ‘Configuring the Cache of a View’ of the Virtual DataPort Administration Guide).
Take into account that Denodo executes a CREATE INDEX statement to create the indexes in the cache database. This statement will create an index with the type considered by default for the database specified as cache. In most databases, this is a B-Tree index. The guidelines to define what indexes to create in the cache are the same as with any other table in a relational database.
When the content of a cached view is invalidated or reaches its time to live, expired/invalidated rows will not appear in query results. Nevertheless, Denodo does not immediately delete them from the physical tables of the database. In turn, Denodo provides a cache maintenance process that periodically removes expired/invalidated rows. It is important to properly configure the execution of this process to avoid performance degradation and excessive storage consumption in the database.
The general settings of the cache configuration allow you to activate the ‘Maintenance’ option. This option will execute a maintenance task periodically but, in production environments, we strongly recommend setting this option to ‘Off’ and, instead, programming the execution of the CLEAN_CACHE_DATABASE predefined stored procedure using the Denodo Scheduler. This has the advantage of allowing you to decide the exact time when the maintenance process is run. This procedure deletes the invalid or expired rows from the tables that hold the cached data of each view, and also the tables created as temporary (see the section ‘Temporary Tables’ of the Virtual DataPort VQL Guide).
For performance reasons, the execution of a maintenance task should be at periods where the workload in the Server and the cache database is expected to be low. One daily execution is usually enough.