This document is aimed at Data Architects and Developers interested in understanding and using the ‘Incremental Queries’ functionality in Denodo. This functionality is designed to improve data retrieval performance from sources accessed through a Wide Area Network (WAN), such as Software as a Service (SaaS) applications (e.g. Salesforce, Google Apps,...).
‘Incremental Queries’ work by merging in real-time the data cached for a view, with the data from the source that has been either added or updated since the last time the cache was refreshed. This way, Denodo can return fully up to date results, without needing to retrieve the full result set from the data source (only the rows added/changed from the last cache refresh).
In order to apply this functionality on a certain view, there must exist a way to obtain the rows that have been added/changed in the source. SaaS applications usually include a timestamp column which can be used for this purpose.
The document includes:
Denodo incorporates a Cache Engine that can store a local copy of the data retrieved from the data sources in a JDBC database. This may reduce the impact of repeated queries hitting the data source and speed up data retrieval, especially when the data sources are accessed through a WAN.
Previously to this new feature, Denodo had two main cache modes:
You can see the Virtual DataPort Administration Guide, section Cache Modes, for more details about the different cache modes in Denodo.
None of the modes listed above allow to have fresh data from the data source, because they go to the cache or to the data source, but not to both in the same query. There is a need for a new cache feature that combines these two modes: the incremental option.
The incremental queries functionality is an option of the full cache mode that allows merging the results obtained from cache with the most recent data retrieved from the source. As it is a variant of the full mode, the cache loads still need to be explicit, i.e, new data is not automatically stored in the cache.
This mode allows enhancing response times by minimizing network traffic at query time: Denodo uses the cache to retrieve most of the data, and only needs to retrieve the added/changed data from the data source.
There are several requirements that need to be met in order to use this option:
With this mode, rows are returned as following:
In this section we are going to see two examples with SaaS sources involved:
To configure the ‘Incremental Queries’ functionality on a view, follow these steps:
Figure 1 Cache configuration for Incremental Queries.
The recommended way to work with this mode is to have a Denodo Scheduler job of type VDPCache (see Scheduler Administration Guide for more information) to preload the cache of the cached views every certain time (every day, for instance). The job could be configured to preload the full data in the cache by invalidating the old data or to only get the data for the last period of time (only if the source does not allow updates or deletions).
This way, we can configure the job to be executed daily in the morning (08:00 am, for instance), as shown in Figure 2 and Figure 3. So, when the user queries the Denodo views, the cache will contain the complete data that was in the source until that time. If, later, the source contains new data or some has been updated, Denodo will take it into account in subsequent queries: most of the data will come from the cache and the newest one from the original source.
Figure 2 Scheduler job to preload the cache.
Figure 3 Scheduler job to preload the cache: scheduling.
In order to access the data from a Salesforce account in Denodo 7.0, you can create the salesforce data source by referring to the Salesforce Sources section of the Virtual DataPort Administration Guide. For previous versions of the Denodo Platform, you can download the Denodo Salesforce Custom Wrapper from the Denodo Support website. Then, follow the instructions to install it and configure it.
Once it is installed, you can create a new data source (Figure 4) and then activate the incremental option for each base view (as shown in Figure 1).
In this example we are going to create a view for the support cases that Acme has stored in Salesforce. General information about support cases is stored in the entity ‘Case’ and its comments in ‘casecomment’, so we need to create two base views in Denodo (sforce_supportcase and sforce_casecomment) and then join them to create the derived view ‘supportcases’, which contains every support case with its comments (Figure 5).
Figure 4 Salesforce data source.
Figure 5 supportcases view: joining support cases with their comments.
In this example, we have the following volume of data:
So, in our scenario, we will spend less than 2 minutes every day to have the cache filled with all the support cases and their comments that were in Salesforce at the moment that the preload job from Scheduler is executed (at 08:00 am). If later users execute queries to retrieve all this data, we get an execution tree as the one shown in Figure 6. Here we can see that for every branch of the main join, data is combined from the cache (blue boxes) and from the source (green boxes), so at the end, we have all the data. This query takes only 2 seconds. If we have not enabled the cache for the two base views, this query would take 1 minute to go to the sources and retrieve the data, every time. Or, if we had configured the full mode for the cache, users will lose last updated data.
Figure 6 Querying supportcases view: getting real time data.
In this example, we are going to retrieve the documents from a Google Drive account. To do so, we will create a JSON data source and specify the resource (https://www.googleapis.com/drive/v2/files) we want to access authenticating ourselves via OAuth 2.0. To know how to integrate the Google Drive API into Denodo, read this document first.
Once the authentication is set and the API is enabled, a JSON data source can be created in Denodo to access the REST API of Google Drive.
In this source, the files contain a field called ‘modifiedDate’ that we will use to parameterize the query to only retrieve the documents modified after a date (we use the interpolation variable ’@lastmodified’, that will be replaced at execution time with its value). Note that we use the interpolation function ExecuteIfIsNotNull to only add the parameter ‘q’ to the URL if the interpolation variable ‘lastmodified’ is not null (see Figure 7). This way, we are making this variable optional, so we can execute the view to retrieve all the files from the source.
The steps to configure the data source and the view are as follows:
Figure 7 Google Drive URL configuration: interpolation variable.
Figure 8 Google Drive Authentication configuration: OAuth 2.0.
Figure 9 Google Drive JSON Wrapper configuration: change Tuple Root.
Figure 10 Google Drive view: default data types as text.
Figure 11 Google Drive view: setting timestamptz data types.
Figure 12 Google Drive cache configuration & mapping with the query.
CREATE OR REPLACE MAP i18n rfc_3339_utc (
'country' = 'US'
'datepattern' = 'yyyy-MM-dd''T''HH:mm:ss.SSS''Z'''
'doubledecimalposition' = '2'
'doubledecimalseparator' = ''
'doublegroupseparator' = ''
'language' = 'en'
'timezone' = 'GMT'
Figure 13 Creating i18n map to deal with Google Drive dates.
Figure 14 Assigning the new i18n map rfc_3339_utc to the view.
Figure 15 Storing the results in cache with the where condition lastmodified = ''
Like in the Salesforce example, when querying the view this way for retrieving all the results, the execution plan shows that Denodo combines results coming from the cache and from the real source, as shown in Figure 16.
Figure 16 Querying gdrive_api_lastmodified view: getting real time data.