USER MANUALS

Full Mode

When a view uses the Full cache mode, the data of the view is always retrieved from the cache database instead of from the source.

The main benefit of this mode over the partial cache is that complex operations (joins, unions, group by …) involving several views (even from different data sources) can be delegated to the cache database. Therefore, the performance of these operations is significantly improved.

With this cache mode, you have to execute a special query (“Cache preload query”) to fill the cache of the view. The results of this query are inserted in the cache and any query to this view will be based on these data.

The “Cache preload query” is a regular query with the parameter 'cache_preload' = 'true' in the CONTEXT clause. For example:

SELECT *
FROM internet_inc
CONTEXT (
      'cache_preload' = 'true'
    , 'cache_invalidate' = 'all_rows'
    , 'cache_wait_for_load' = 'true'
    , 'cache_return_query_results' = 'false')

The “Cache preload query” has to meet the following conditions:

  1. It has to project all the fields of the view. I.e.: SELECT * FROM ...

  2. It cannot have aggregation clauses or aggregation functions. I.e. GROUP BY or HAVING are forbidden. It can have selection conditions (WHERE clause)

  3. It cannot have subqueries in the WHERE clause.

  4. It has to be executed over the view that has the full cache mode enabled; not over its underlying views nor its derived views.

You will have to execute these queries periodically. To do this, we recommend using the Denodo Scheduler Module because it provides a wizard where you can select the views whose cache you want to preload, how often and select the parameters used in these queries. Visit section VDPCache Extraction Section for information on how to configure the load process.

The following sections describe two ways to preload the cache of a view with “Full” cache:

  1. Loading the cache invalidating the existing data. This is the most common way of doing it. See below.

  2. Loading the cache incrementally. See Loading the Cache Incrementally.

Loading the Cache Invalidating the Existing Data

When the query has the parameters ('cache_preload' = 'true',  'cache_invalidate' = 'all_rows') in the CONTEXT clause of the query, the Cache Engine marks as invalid all the cached data of the queried view and then, stores the result of the query in the cache database.

For example, this query invalidates all the cached rows for the view customer and then, caches the result of the query:

SELECT *
FROM customer
CONTEXT (
  'cache_preload' = 'true'
, 'cache_invalidate' = 'all_rows'
, 'cache_wait_for_load' = 'true'
, 'cache_return_query_results' = 'false')

Loading the Cache Incrementally

Incremental cache refresh processes minimize the amount of data that needs to be moved from the data source to the cache. Nevertheless, they can only be applied when certain requirements are met. This section describes the different options available and the restrictions that apply to each one.

In order to refresh the cache incrementally it is necessary to identify the content to update. Depending on this criteria we can classify the refresh in two approaches:

  1. Incremental refresh based on time: You can use Denodo Scheduler to configure a job that periodically updates the cache with the new data added or modified in the data source since the last time the cache was refreshed. This is the most common scenario. To identify the new or changed rows, the cached view must contain some datetime information like a column with the date of creation or last modification of each row. Notice that this option can only be used with views whose rows are never deleted. In other cases, the refresh process will add/update the changed rows but the cache will still contain the rows removed in the data source.

  2. Incremental refresh based on content: Used when it is not possible to identify the changes based on time but it is possible to identify what elements to update based on other content. For example, let us say a department has been relocated to a different address. In this case it is possible to invalidate the rows matching a WHERE condition specifying the department id to update, and then adding the result of the query to the cache database. This kind of updates could be triggered automatically using JMS messages.

Incremental Cache Loads Based on Time

The timestamp of the last cache load is accessible using the variable '@LAST_REFRESH_DATE' so you can use it in a WHERE condition to identify the new data you want to insert. For example, let us say that you have a table user with cache mode “full” and you need to update the data cached for this view to add the new records added since the last cache preload. The following query will only read from the original data source those users added since the last cache refresh and insert only those in the cache:

SELECT *
FROM user
WHERE start_date >= '@LAST_REFRESH_DATE'
CONTEXT (
  'cache_preload' = 'true'
, 'cache_wait_for_load' = 'true'
, 'cache_return_query_results' = 'false')

This option only inserts new data but it does not update or delete existing cached data.

Incremental Cache Loads Based on Content

There are scenarios when an incremental load based on time is not possible but you still need to load the cache by executing more than one query. For example, in order to load the cache for the first time, doing it with a single query would put too much load on the source. Adding the parameter ('cache_preload' = 'true') to the CONTEXT clause of the query but not 'cache_invalidate': the result of the query is stored in the cache database, without invalidating the current cached data. By using this parameter and executing several queries, the cache can be loaded incrementally.

Adding this parameter is equivalent to executing the view from the “Execute” dialog of the view, selecting the “Store results in cache” check box and clearing the “Invalidate existing results” one. When doing this, make sure the results of the queries you use to load the cache incrementally, do not overlap. For example, if you execute the following queries:

SELECT *
FROM phone_inc
WHERE pinc_id > 0
CONTEXT (
'cache_preload' = 'true'
, 'cache_wait_for_load' = 'true'
, 'cache_return_query_results' = 'false')

And,

SELECT *
FROM phone_inc
WHERE pinc_id > -1
CONTEXT (
    'cache_preload' = 'true'
  , 'cache_wait_for_load' = 'true'
  , 'cache_return_query_results' = 'false')

The cache will store the data returned by both queries. Then, if you execute the query SELECT * FROM phone_inc, the result will be union of these queries, which will contain the same rows twice.

Incremental Cache Loads Updating the Existing Cached Data

If the original data can suffer updates apart from inserts and the view has a primary key defined, you can use the parameters ('cache_preload' = 'true',  'cache_invalidate' = 'matching_pk') in the CONTEXT clause. In thise case, the Cache Engine updates the already cached rows and caches the new rows that do not yet exist in the cache. The Cache Engine uses the primary key of the view to determine if a row is already cached. A row from the cache and a row from the source are considered the same if the values of the primary key are the same. Adding these parameters to the query is equivalent to executing the view from the “Execute” dialog of the view and selecting the “Store results in cache” and the “Replace rows with the same PK value” check boxes.

For example, lets say the table user from the previous example can suffer updates apart from additions, and it includes a column modification_date to identify the moment of the last update. In that case we could update the cached data with the following query:

SELECT *
FROM user
WHERE modification_date >= '@LAST_REFRESH_DATE'
CONTEXT (
  'cache_preload' = 'true'
, 'cache_invalidate' = 'matching_pk'
, 'cache_wait_for_load' = 'true'
, 'cache_return_query_results' = 'false')

This query only reads those users whose modification date is later than the last cache preload, and as it has the parameter 'cache_invalidate' = 'matching_pk', the Cache Engine updates the rows whose primary key already exists in the cache table and caches the new users obtained from the source.

A view has to meet the following requirements to support this cache load method:

  1. A Primary Key defined on the cached view.

  2. A cache data source that supports this feature. See the section Data Sources That Support Merge Data to get the list of data sources that support the parameter 'cache_invalidate' = 'matching_pk'.

  3. The source cannot remove rows from the data; only add or update them. If rows are deleted in the source, the view will return the deleted rows until the cache of this view is invalidated and cached again. Many scenarios meet this requirement because rows are not actually deleted but instead, are marked as deleted or invalid.

  4. If you include a condition like in the example above, Virtual DataPort must support pushing down to the source conditions over these fields that indicate when each row was added/updated.

In case the view does not include a primary key, or the cache data source does not support the previous feature, you can use the option ‘cache_invalidate’ = ‘matching_rows’. Adding the parameters ('cache_preload' = 'true', 'cache_invalidate' = 'matching_rows') causes the Cache Engine to invalidate the cached rows that match the WHERE clause of the query and then, cache the result of the query. Adding these parameters to the query is equivalent to executing the view from the “Execute” dialog of the view and selecting the “Store results in cache” and the “Invalidate existing results” check boxes.

For example, let us say that you have a table order whose cache mode is full and you need to update the data cached for this view in order to have the most recent records for the current year. However, you do not want to invalidate the cached data for the previous years because it has not changed and retrieving it all again it would take too much time. With the query below, you achieve this: it only returns the orders whose date is the current year and as it has the parameter 'cache_invalidate' = 'matching_rows', the Cache Engine only invalidates the rows that match the condition of the query and caches the data obtained from the source.

SELECT *
FROM order
WHERE order_date >= TRUNC( CURRENT_DATE(), 'Y')
CONTEXT (
    'cache_preload' = 'true'
  , 'cache_invalidate' = 'matching_rows'
  , 'cache_wait_for_load' = 'true'
  , 'cache_return_query_results' = 'false')

Take into account that this option is only possible if the cache data source supports UPDATE statements. In general, cache databases using storage systems based on parquet files like Hive, Impala, Spark, Athena or Presto do not support UPDATE statements, although there are a few exceptions like Impala using Kudu and Spark Databricks. In those cases where the original data includes updates but the data source does not support UPDATE statements you will need to perform a full refresh using the option ‘cache_invalidate’ = ‘all_rows’.

The Full Cache Mode at Runtime

When querying a view with Full cache enabled, the data is always retrieved from the cache, even if no cache preload queries have been executed. If the cache of the view has not been preloaded, the queries to this view will return 0 rows.

In terms of performance, the main benefit of using the Full cache mode is that if you combine two views with cache Full enabled, the query is delegated to the cache. This may result in a faster execution time. This optimization is not available when using the Partial cache mode.

For example, if you enable the full cache mode on two views whose data comes from different data sources, and you execute a JOIN between them, the join will be delegated to the cache database instead of executing it in the Server.

Important

When this cache mode is enabled, the cache engine does not check the cache preload queries. For example, if the view was loaded with this query:

SELECT *
FROM employee WHERE id = 1
CONTEXT ('cache_preload' = 'true', 'cache_invalidate' = 'all_rows')

And later execute the query

SELECT *
FROM employee

The data is retrieved from the cache, which means that you will only obtain the rows that match the condition id = 1.


The cache engine does the following when a query has the parameter 'cache_preload' = 'true':

  1. The cache engine checks that the table that is supposed to store the cache data of this view exists. It also checks that the table has the appropriate schema. This table is created when you enable the full or partial cache mode.

  2. If the query includes the variable @LAST_REFRESH_DATE, Denodo replaces that variable to use the timestamp corresponding to the last cache load.

  3. The cache engine checks if it is possible to let the cache data source perform the operation without transferring any rows to Denodo. The “cache preload query” must meet these conditions:

    1. The data is obtained from a single JDBC data source.

    2. The Cache Engine is configured to use said data source.

    3. The entire query is delegated to the same database.

    4. The database supports the statement INSERT INTO SELECT.

    If all these conditions are met, the Cache Engine executes the statement INSERT INTO ... SELECT ... to store the result of the query directly in to the appropriate table of the schema that the cache engine uses in this database. In this case, the data is not transferred from the database to Denodo and back to the same database. In addition, Denodo does not have to process the data.

    If the load query included the parameter 'cache_invalidate' = 'matching_pk' the Cache Engine sends a MERGE command to the data source instead.

    Take into account that in this situation, the client application (Administration Tool, Design Studio, etc.) will not receive the rows inserted in to the cache table. That is because Virtual DataPort does not receive these rows either; Virtual DataPort executes the statement MERGE INTO <cache table> SELECT... to store the data directly into the cache table without passing through Denodo. If you want the client application to receive the inserted rows, add the parameter 'cache_return_query_results' = 'true' to the CONTEXT clause of the query. Note that this parameter disables this optimization and will cause the query to run slower.

  4. If the conditions of step #3 are not met and the load query includes the 'cache_invalidate' = 'matching_pk' option, then the Cache Engine executes the following steps:

    1. Queries the source using the WHERE condition specified in the query.

    2. Creates a temporary table in the cache data source and inserts the result of the previous query.

    3. Executes a MERGE command on the source to MERGE the data inside this temporary table into the cache, also known as UPSERT.

    4. Drop the temporary table.

If something goes wrong during the insertion, as it is inserting rows in a temporary table and not directly on the cache, the operation will be aborted before the MERGE command and therefore the cache will not have inconsistent data. VDP will update the value of the @LAST_REFRESH_DATE only after a successful operation. It means the @LAST_REFRESH_DATE parameter value is the date of the last successful cache load. If the cache load fails when reading the data or writing the data, then VDP will not update the value of the @LAST_REFRESH_DATE parameter.

  1. If the conditions of step #3 are not met and the load query does not include the 'cache_invalidate' = 'matching_pk' parameter:

    • As soon as the query begins returning rows, the cache engine begins inserting them into the cache table of the view, with the state “processing”.

    • Once all the rows are inserted in the cache table, the cache engine starts a transaction on the cache database. In this transaction, the cache engine executes one or two queries, in this order:

      1. If the query has the parameter 'cache_invalidate', it will execute an UPDATE statement to change the rows with status “valid” to “invalid”.

        If the query has 'cache_invalidate' = 'all_rows', the status of all the “valid” rows switches to “invalid”.

        If the query has 'cache_invalidate' = 'matching_rows', the status of the “valid” rows that match the WHERE condition of the query switches to “invalid”. If the query does not have a WHERE clause, matching_rows is equivalent to all_rows.

      2. An UPDATE statement that changes the status of the “processing” rows to “valid”.

    • The cache engine commits the transaction and from now on, the queries that hit this view will see the new data.

This process ensures that the queries that hit a view with full cache mode always return valid data, even if another query hits the view while its cache is being loaded.

If a query hits the view while data is being inserted on the cache table, the data is valid because the cache engine always returns the rows with status “valid”. The rows that have been inserted so far have the status “processing”, not “valid.”

If a query hits the view while the status of the rows is being changed from “processing” to “valid”, the data returned is valid because this UPDATE is performed inside a transaction. Therefore, this query will still return the “old” rows, but will not return partially modified rows.

If cache_wait_for_load is false, as soon as all the rows of the result have been returned to the client, the client is notified that the query finished, even if the data has not been completely inserted in the cache database. If cache_wait_for_load is true or not present, the client is notified that the query finished after all the steps listed above finish.

Add feedback