USER MANUALS

DAG Cache Management Configuration Section

To define the parameters for the configuration section for DAG Cache Management-type jobs a VDP-type data source must be selected. Once a data source is selected, you can select elements such as a view, a set of views, a folder, a virtual database, or the full server from VDP server tree, by clicking the magnifying glass icon from the Views / Summaries input field.

The job will identify the cache-enabled views and summaries within the selected elements and dynamically create the load processes to execute cache loads or summary refreshes as follows:

  • The job will respect implicit dependencies defined by the graph.

  • Subject to dependency constraints, the job will execute as many tasks as possible in parallel (up to the maximum concurrency level).

  • If the view configuration specifies a condition to retrieve new rows since the last cache load, the load process will be incremental, using the primary key (PK) if available.

For example, if view “cacheViewB” depends on data from view “cacheViewA”, the job will first refresh the cache for “cacheViewA” and then update “cacheViewB”.

  • The magnifying glass icon next to the Excluded Views input field allows you to specify views that should not be refreshed. This is useful when selecting a database or folder but not all views within it need to be refreshed.

  • You can configure the number of views refreshed concurrently during job execution using the Concurrency Level setting.

  • Enabling the Refresh views even though any dependencies finish with errors option allows views to be refreshed even if a dependency fails (for example, if “cacheViewA” fails to refresh, “cacheViewB” will still be refreshed). By default, dependent views will not be refreshed if a dependency fails.

  • When the Use summaries to populate cache when available option is enabled, the job will refresh data using summaries as implicit dependencies when possible. When VDP refresh a summary or cache, it can use the views defined by user (explicit dependencies) or it can determine that the data can be obtained from a summary (implicit dependency). Enabling this option prevents the 'summary_rewrite' = 'off' option from being added to the CONTEXT clause of the query sent to VDP. Disabling this option adds 'summary_rewrite' = 'off' to the CONTEXT clause. This option is disabled by default.

Cache Configuration Parameters

Configuration options applied to queries to refresh cache.

  • The Populate the cache even when a query finishes with errors option controls whether the cache is populated even if a query returns an error. Enabling this option adds 'cache_load_on_error' = 'true' to the CONTEXT clause of the query sent to VDP. Disabling it adds 'cache_load_on_error' = 'false' to the CONTEXT clause. This option is disabled by default.

  • The query Context field allows to specify properties for the CONTEXT clause of the query sent to VDP. For instance, you can specify 'simplify' = 'off' in order to not apply simplifications to the query when executing it. Note that the options 1 applicable to the CONTEXT clause that are configured through the rest of parameters cannot be specified here.

Summary Configuration Parameters

Configuration options applied to queries to refresh summaries.

  • The query Options field allows to specify properties for the OPTIONS clause of the query sent to VDP, which configures the data insertion process. Not all data sources support all parameters. The available insert options are:

    • batch_insert_size: the number of tuples in each batch insert.

    • parquet_row_group_size: the row group size in megabytes. This property is only valid for data sources where Virtual DataPort uses parquet files for data insertion: Athena, Hive, Impala, Presto, and Spark.

Cache Queries

To preload the cache of each view, Scheduler checks if a “Condition to retrieve data modified since the latest cache refresh” is defined in Virtual DataPort. If so, an incremental load is performed; otherwise, a full cache load occurs.

Regarding the cache invalidation before being loaded:

  1. If the load is incremental and the view has a primary key defined, then the option ‘cache_invalidate’ = ‘matching_pk’ is added to the CONTEXT clause of the query that Scheduler sends to VDP. In this case, no rows are invalidated from the cache. VDP updates the tuples that match the primary key of the rows returned by the query and inserts the tuples that do not match.

  2. If the load is incremental but the view does not have primary key, the cache is not invalidated.

  3. If the load is not incremental, the option ‘cache_invalidate’ = ‘all_rows’ is added to the CONTEXT clause of the query which is sent to VDP. In this case, all tuples from the cache are invalidated, regardless of the query.

SELECT *
FROM view
[ WHERE <Condition to retrieve data modified since the latest cache refresh> ]
CONTEXT(
'cache_preload' = 'true',
'cache_wait_for_load' = 'true',
'cache_return_query_results' = 'false',
[ 'cache_invalidate' = { 'all_rows' | 'matching_pk' },
'cache_atomic_operation' = 'true'
'cache_load_on_error' = { 'true' | 'false' } )

Summary Queries

For Summaries, if the VDP field “Custom Load Query” has value in VDP, the query sent to VDP will be incremental.

Non-Incremental Query

REFRESH summary_view
[ OPTIONS ( <option information> [ , <option information> ]* ) ]
TRACE

where

<option information> ::=
      'batch_insert_size' = <literal>
    | 'parquet_row_group_size' = <literal>
    | 'summary_rewrite'='off'

Incremental Query

INSERT INTO summary_view
  [ OPTIONS ( <option information> [ , <option information> ]* ) ]
select * from (<Custom Load Query>)
[ TRACE ]

Note

When the connection to VDP is lost, the user can save the job as draft in order to not lose the work.

1

These properties are cache_preload, cache_wait_for_load, cache_return_query_results, cache_invalidate, cache_atomic_operation and cache_load_on_error.

Add feedback