This document is aimed at Data Architects and Developers interested in understanding and using the ‘Incremental Queries’ functionality in Denodo. This functionality is designed to improve data retrieval performance from sources accessed through a Wide Area Network (WAN), such as Software as a Service (SaaS) applications (e.g. Salesforce, Google Apps,...).
‘Incremental Queries’ work by merging in real-time the data cached for a view, with the data from the source that has been either added or updated since the last time the cache was refreshed. This way, Denodo can return fully up to date results, without needing to retrieve the full result set from the data source (only the rows added/changed from the last cache refresh).
In order to apply this functionality on a certain view, there must exist a way to obtain the rows that have been added/changed in the source. SaaS applications usually include a timestamp column which can be used for this purpose.
The document includes:
- A general explanation of the Denodo cache and the new ‘Incremental Queries’ option.
- Two examples showing how to use it with two popular SaaS applications: Salesforce and Google Drive
Denodo incorporates a Cache Engine that can store a local copy of the data retrieved from the data sources in a JDBC database. This may reduce the impact of repeated queries hitting the data source and speed up data retrieval, especially when the data sources are accessed through a WAN.
Previously to this new feature, Denodo had two main cache modes:
- Full: When this mode is used for a view, it is assumed that the cache contains all the tuples of the view. Therefore, queries on this view will always use the cached data and will never hit the data source. The data source will be only accessed for preloading the cached data. Caching data can significantly improve performance, especially when the data source is accessible through a WAN. Nevertheless, if the data source contains new fresh data after preloading the cache, it will not be reflected in the results of the query until you preload the cache again.
- Partial: the cache only stores some of the tuples of the view. At runtime, when a user queries a view with this cache mode, the Server checks if the cache contains the data required to answer the query. If not, it retrieves the data from the data source and stores it in the cache. This way, if the cache for a view contains data with some_field = some_value and a query only filters results with that exact condition (or a more restrictive condition), they are obtained from the cache; otherwise, Denodo queries the source and stores the result in the cache. Like in the full mode, if there are new tuples in the data source satisfying the filtering condition, Denodo will respond with the data that is currently in the cache, so new data will not be included in the results of the query.
You can see the Virtual DataPort Administration Guide, section Cache Modes, for more details about the different cache modes in Denodo.
None of the modes listed above allow to have fresh data from the data source, because they go to the cache or to the data source, but not to both in the same query. There is a need for a new cache feature that combines these two modes: the incremental option.
The incremental queries functionality is an option of the full cache mode that allows merging the results obtained from cache with the most recent data retrieved from the source. As it is a variant of the full mode, the cache loads still need to be explicit, i.e, new data is not automatically stored in the cache.
This mode allows enhancing response times by minimizing network traffic at query time: Denodo uses the cache to retrieve most of the data, and only needs to retrieve the added/changed data from the data source.
There are several requirements that need to be met in order to use this option:
- This option is only available for base views.
- The view must have a primary key.
- The source must allow to delegate an expression to retrieve the rows that have been added/updated since a certain time.
- The source cannot remove rows, only add/update them.
- It is important to notice that the rows deleted in the data source will continue to appear in the results until the new cache preload. In the common case where rows in source relations are not deleted (e.g. orders data change state to 'closed' but are not deleted), the obtained data is 100% up to date.
With this mode, rows are returned as following:
- If a row from the cache does not match with any row from the source, the row from the cache is returned.
- If a row from the source does not match with any row from the cache, the row from the source is returned.
- If a row from the source matches with a row from the cache (because it has been updated since last cache refresh), the row from the source is returned. This way, returned data is always fresh data.
- Rows match because they have the same primary key.
In this section we are going to see two examples with SaaS sources involved:
- Google Drive.
To configure the ‘Incremental Queries’ functionality on a view, follow these steps:
- Go to the configuration Options for the view.
- In the Cache tab set this new mode and configure the expression to retrieve new data (see Figure 1).
- The condition to retrieve the new/updated rows from the source has to include an interpolation variable ‘@LASTCACHEREFRESH’ that will be replaced at runtime with the timestamp representing the last time the cache was preloaded.
- Denodo automatically adds this condition to the query against the base view.
- For instance, for a Salesforce source it could be lastmodifieddate > '@LASTCACHEREFRESH'.
- When executing the query, Denodo will combine all tuples from the cache with the tuples from the source having a lastmodifieddate after the @LASTCACHEREFRESH.
Figure 1 Cache configuration for Incremental Queries.
The recommended way to work with this mode is to have a Denodo Scheduler job of type VDPCache (see Scheduler Administration Guide for more information) to preload the cache of the cached views every certain time (every day, for instance). The job could be configured to preload the full data in the cache by invalidating the old data or to only get the data for the last period of time (only if the source does not allow updates or deletions).
This way, we can configure the job to be executed daily in the morning (08:00 am, for instance), as shown in Figure 2 and Figure 3. So, when the user queries the Denodo views, the cache will contain the complete data that was in the source until that time. If, later, the source contains new data or some has been updated, Denodo will take it into account in subsequent queries: most of the data will come from the cache and the newest one from the original source.
Figure 2 Scheduler job to preload the cache.
Figure 3 Scheduler job to preload the cache: scheduling.
In order to access the data from a Salesforce account in Denodo 7.0, you can create the salesforce data source by referring to the Salesforce Sources section of the Virtual DataPort Administration Guide. For previous versions of the Denodo Platform, you can download the Denodo Salesforce Custom Wrapper from the Denodo Support website. Then, follow the instructions to install it and configure it.
Once it is installed, you can create a new data source (Figure 4) and then activate the incremental option for each base view (as shown in Figure 1).
In this example we are going to create a view for the support cases that Acme has stored in Salesforce. General information about support cases is stored in the entity ‘Case’ and its comments in ‘casecomment’, so we need to create two base views in Denodo (sforce_supportcase and sforce_casecomment) and then join them to create the derived view ‘supportcases’, which contains every support case with its comments (Figure 5).
Figure 4 Salesforce data source.
Figure 5 supportcases view: joining support cases with their comments.
In this example, we have the following volume of data:
- Case: 7238 rows that take about 20 seconds to be cached.
- CaseComment: 76612 rows that take about 1:30 minutes to be cached.
So, in our scenario, we will spend less than 2 minutes every day to have the cache filled with all the support cases and their comments that were in Salesforce at the moment that the preload job from Scheduler is executed (at 08:00 am). If later users execute queries to retrieve all this data, we get an execution tree as the one shown in Figure 6. Here we can see that for every branch of the main join, data is combined from the cache (blue boxes) and from the source (green boxes), so at the end, we have all the data. This query takes only 2 seconds. If we have not enabled the cache for the two base views, this query would take 1 minute to go to the sources and retrieve the data, every time. Or, if we had configured the full mode for the cache, users will lose last updated data.
Figure 6 Querying supportcases view: getting real time data.
In this example, we are going to retrieve the documents from a Google Drive account. To do so, we will create a JSON data source and specify the resource (https://www.googleapis.com/drive/v2/files) we want to access authenticating ourselves via OAuth 2.0. To know how to integrate the Google Drive API into Denodo, read this document first.
Once the authentication is set and the API is enabled, a JSON data source can be created in Denodo to access the REST API of Google Drive.
In this source, the files contain a field called ‘modifiedDate’ that we will use to parameterize the query to only retrieve the documents modified after a date (we use the interpolation variable ’@lastmodified’, that will be replaced at execution time with its value). Note that we use the interpolation function ExecuteIfIsNotNull to only add the parameter ‘q’ to the URL if the interpolation variable ‘lastmodified’ is not null (see Figure 7). This way, we are making this variable optional, so we can execute the view to retrieve all the files from the source.
The steps to configure the data source and the view are as follows:
- Configure the data source:
- Configure the URL: https://www.googleapis.com/drive/v2/files?maxResults=1000^ExecuteIfIsNotNull("&q=modifiedDate>='",@lastmodified,"'")
- The use of the interpolation function ExecuteIfIsNotNull was already explained.
Figure 7 Google Drive URL configuration: interpolation variable.
- Configure the Authentication:
- It is necessary to follow the steps explained in the former document to get the OAuth tokens.
Figure 8 Google Drive Authentication configuration: OAuth 2.0.
- Create and configure the base view ’gdrive_api_lastmodified’:
- Change Tuple Root:
Figure 9 Google Drive JSON Wrapper configuration: change Tuple Root.
- As the data source is a JSON-type one, by default, every data type is text. We need to manually set the data types to date where appropriate. In addition to that, we need to set a primary key which is a requirement for the incremental cache mode.
Figure 10 Google Drive view: default data types as text.
Figure 11 Google Drive view: setting timestamptz data types.
- Then, we can configure the incremental option to query the source with the @LASTCACHEREFRESH date as the value for the interpolation variable of the query (the cache condition would be lastmodified = '@LASTCACHEREFRESH', as shown in Figure 12).
Figure 12 Google Drive cache configuration & mapping with the query.
- In order to be able to work with the dates from Google Drive API it is necessary to create a new i18n map, like in Figure 13, and assign it to the view (Figure 14).
CREATE OR REPLACE MAP i18n rfc_3339_utc (
'country' = 'US'
'datepattern' = 'yyyy-MM-dd''T''HH:mm:ss.SSS''Z'''
'doubledecimalposition' = '2'
'doubledecimalseparator' = ''
'doublegroupseparator' = ''
'language' = 'en'
'timezone' = 'GMT'
Figure 13 Creating i18n map to deal with Google Drive dates.
Figure 14 Assigning the new i18n map rfc_3339_utc to the view.
- For preloading the cache initially and having all files from the Google Drive covered, the cache needs to be preloaded by the condition lastmodified = ''. This condition sets the field lastmodified as empty in order to avoid the interpolation function ExecuteIfIsNotNull to be executed and therefore retrieving all results (Figure 15).
Figure 15 Storing the results in cache with the where condition lastmodified = ''
- As we have initially loaded the cache, now when executing the query, we need to remove the condition in order to let the mapping between @LASTCACHEREFRESH and modifieddate from the url work (see Figure 12).
Like in the Salesforce example, when querying the view this way for retrieving all the results, the execution plan shows that Denodo combines results coming from the cache and from the real source, as shown in Figure 16.
Figure 16 Querying gdrive_api_lastmodified view: getting real time data.
- Incremental caching requires that there is some way to query the original data source to determine what has changed.
- It is very common that sources in the cloud provide a timestamp column for the last update time.
- It does not detect when rows have been deleted from the source.
- This cache option does not automatically insert or update the tuples in the cache.