USER MANUALS

Configuring the Materialization of a View (pka Cache)

Before configuring the materialization in a specific view, you have to enable the cache module in the Server (see section Configuring the Cache).

To configure the materialization of a view, open the view, click Options and select one of the following types:

  • Virtual. Data will be retrieved from the underlying models in real-time. This is the default.

  • Full. The Server assumes that the materialization contains all the tuples of the view. Therefore, queries on this view will always use the materialized data and will never “hit” the underlying model. The materialzed data is always loaded explicitly, meaning that you need to execute query to load data into the materialization.

  • Query results. When a query is executed against the view, the Server checks if the materialization contains the data required to answer the query. If not, data will be retrieved from the underlying models.

If you are executing a IDU operation (i.e. INSERT, DELETE or UPDATE) over a materialized view, you have to take into account the following:

  • These operations are executed directly against the source tables, not the materialization tables.

  • These operations invalidate the current materialized data.

The section Materialization Types explains in detail how these cache types work.

Materialization Schema

Custom table name allows to create a configured table name instead of an autogenerated name. This option is only available for Full materialization type.

If you are activating the cache for first time VDP will try to create a table with the provided name. If there is a name conflict and the table name is not available, then the table will contain a numeric suffix.

If you change the name of the custom table name on a view that already has a full materialization, VDP will try to rename its underlying table. If table renaming is not available in the database, VDP will issue a warning and the table’s name will remain unchanged

Note

Because of potential name conflicts and table rename not supported in the database, VDP cannot guarantee that a materialized table will match the same name as the one configured. To address this, the “Fetch table name” option is used to determine the table’s actual name after creation

Limitations on Table Renaming The table renaming feature is not available in the following cases:

  • On Azure Data Platforms:
    • Azure SQL

    • Azure Synapse SQL

  • On Hadoop-based Databases:
    • Presto, Impala, Databricks, etc.

  • Amazon Athena

  • Under Specific Configurations:

    It is not supported when using a custom catalog or schema for Sybase, SQL Server, and PostgreSQL-based databases:

    • Netezza

    • Microsoft Fabric DW

    • SQL Server-based databases

    • Sybase-based databases

    • PostgreSQL-based databases


Schema evolution You can configure the cache’s behavior for when the view’s schema is modified. The available options are as follows:

  • Recreate schema: This is the default behavior. If a view’s schema is updated and it no longer matches the materialized table, VDP recreates the table, resulting in the loss of all previously materialized data.

  • Adapt schema: When this option is selected, VDP will attempt to update the materialized table to match the view’s new schema.

    Columns will not be automatically dropped unless the Column cleanup option is selected. If that option is not selected, you can drop them manually using the Force column cleanup button.

    Schema evolution may perform the following actions at the materialized table:

    • Add columns: The basic operation for schema adapt. A new column is added at the materialization table when a new field is created at VDP.

    • Rename columns: This is used to reflect column name changes at VDP. It is also used to substitute a previously existing column with a new one with a new type.

    • Retype columns: When a column type changes VDP may try to adapt the column of the materialized table if that operation is available at the database for the following scenarios:

      • Numeric type updated to a larger type (like int to float).

      • Increasing the size of a text column (increasing the size of a VARCHAR column)

      In case VDP can not change the column type a new column will be added:

      • If column rename is available the new column will use the name of the original one.

      • Already materialized rows will contain null values for the new column.

      • The column with the old type will not be removed unless column cleanup is enabled.

Note

Not all the databses allow to add, drop, rename or change type in their tables so the options available and the adapt strategy may be different. For detailed information see Vendor-specific capabilities for schema evolution


Time to Live (seconds) is the Time to live in seconds of the entries stored in the view’s materialized table when using Query results. Once this limit is reached, these entries expire and the next query will retrieve the data from the data source, instead of from the materialization. If you select Never expire, theb Server will never invalidate the entries stored in the materialization. However, you still can invalidate it manually by clicking on Delete stored results.

The Default time to live is the Time to live defined for the database in the “Cache configuration” dialog of the database (Administration > Database Management dialog). If the database does not set the “Time to live”, the cache module uses the global value of this property, which is defined in the “Cache” tab of the Server Configuration dialog.

The rows obtained from the source are inserted in cache in batches. The Batch insert size (rows) determines the number of rows per batch. The Default value is the Batch size defined for the database in the “Cache configuration” dialog of the database (Administration > Database Management dialog). If the database does not set the “Batch size”, the cache module uses the global value of this property, which is defined in the “Cache” tab of the Server Configuration dialog.

Data Load

Data load configuration options depends on the type of materialization selected.

For Query results you can configure if rows retrieved are automatically materialized when retrieved from the underlying model or they need to be stored in the materialization explicitly. See Query Results Type for more detail.

For Full materialization you can configure the incremental load method. With this option, the queries to this view will be “incremental queries”. That is, the queries to this view merge the results obtained from the materialization with the most recent data retrieved from the source. The main benefit of this mode is that the queries will always return fully up to date results without needing to retrieve the full result set from the data source, just the rows that are added/changed since the last cache refresh.

To be able to use this mode, the view has to have a field that contains when a row was last inserted/updated. For example, if the view has a field last_modified_date, you have to enter a condition like last_modified_date > '@LAST_REFRESH_DATE'.

When executing the query, the variable @LAST_REFRESH_DATE is replaced with the timestamp of the last time a query to load the cache was executed and finished successfully.

As it is a variant of the full materialization, unlike with query results type, the materialization of the view still needs to be loaded explicitly (i.e. you have to execute a query with the appropriate parameters to load the data into the materialization).

Warning

The incremental mode has several requirements and limitations that you need to be aware of. The section Incremental Mode lists them.

When using DAG Cache Management jobs for incremental data load you can configure how to execute load queries after the schema of the materialization table has changed.

  • Keep incremental. The default behavior. DAG jobs will always execute an incremental load.

  • Full reload. This option forces DAG jobs to perform a full reload whenever new columns are added to the materiazed table. This is useful for preventing null values from appearing in the new columns for rows that were already materialized.

For both Query results and Full you can invalidate materialized data manually. To do it, click on Delete stored results and select the following options:

  • Invalidate all. Remove all the materialized content for the view.

  • Invalidate partially. Remove only the entries that match the condition in the Edit condition dialog.

  • Invalidate on cascade. If selected, VDP also invalidates the materilization of the subviews of this view. This option is disabled for views with “Full” materialization mode. In addition, if you invalidate on cascade a view, the materialization of the subviews whose mode is “Full” will not be invalidated.

See the section Invalidate Cache for information about invalidating the materializaation content of more than one view at a time.

Cache settings: invalidating the materialized data of a view

Cache settings: invalidating the cache of a view

Query Execution

In this section you can configure some aspects on the queries sent to cached views. see Query Results Type and Incremental Mode for more details.

Materialization Table Indexes

When enabling the cache for a view, the Cache Module creates a table in the cache’s database that will store the cached data of this view. You can define “Materialization Table Indexes” for this view to speed up the retrieval of data from the view’s cache.

The difference between the materialization table indexes and the indexes defined in the “Indexes” tab is that the latter ones represent the indexes defined in the source. The cache module actually creates the “Materialization Table Indexes” in the cache’s database, when they are defined in this tab.

When the cache is enabled for a view and its cache mode is “Full”, the indexes propagated to its derived views and to the ODBC and JDBC clients, are the “Materialization Table Indexes” and not the indexes defined in the “Indexes” tab. That is because the queries of the view will always retrieve the data from the cache and not from the source.


To create a cache index, follow these steps:

  1. If the view has indexes (see the “Indexes” tab), you can replicate them in the table that will store the view’s cached data, by clicking on Import indexes from view.

    Note that the indexes will not be created until you click Ok to close the dialog.

  2. To define more indexes in the cache table, click Create index and follow these steps:

    1. Enter the name of the index.

      Note

      Depending on the database you are using to store the cache, the name of the index may need to be unique across the entire Virtual DataPort database and not just the view you are editing. That is because some databases do not allow defining two indexes with the same name, even if they are defined over different tables.

    2. Select Is unique if the index represents a unicity index.

    3. In the left side of the dialog, there is a list of the fields of the view. Select the ones that belong to the index and click Add >>. To select several fields at once, hold Ctrl or Shift and click on the fields you want to select.

      To change the order of the fields in the index, select the check box of the field you want to move and click on image1 and image2.

      To remove a field from the index, select its check box in the list of the right side and click << Remove.

    4. Click Ok to create the index.

  3. To edit a cache index, select its check box and click Edit selected index.

  4. To remove a cache index, select its check box and click Remove selected index.

Cache settings: defining a new "cache index"

Cache settings: defining a new “cache index”

Materialization Refresh Jobs

When a view has a cache configured to store data in the cache database’s table, you can load the data manually or create a cache job in Scheduler to do so. This section shows the cache refresh jobs configured for the view and a summary of the most relevant information. You will also have links to the Scheduler Web Administration Tool to get more detailed information about the jobs, manage them, and create new ones.

Note

The Cache refresh jobs feature is only available in Design Studio of the Solution Manager.

The table with the Scheduler cache jobs will load the cache jobs when the collapsible section is expanded the first time. Afterward, to update the table, use the button Reload jobs. The table contains the following information on the cache refresh jobs of the view:

  • Job. The name of the cache job. This is a link that will open a new tab for the job details in the Scheduler Web Administration Tool.

  • Type. The type of the cache job. There are two types of cache jobs in Scheduler: Simple Cache Management and DAG Cache Management jobs.

  • State. The current state of the job. In particular, a job can be running, not running, waiting, or disabled.

  • Last execution. This shows the last time the job was executed. It will be blank if it has never been executed.

  • Next execution. This shows the next time the job will be executed. It will appear blank if it has been disabled or, according to its time-based scheduling, will not run again.

  • Result of the last execution. Shows the completion state of the latest job execution. A job can be completed correctly, end with an error condition, end with a warning, be stopped by the user, or misfire (meaning the scheduled execution did not run). If the job ran and did not finish correctly, this field will be a link to the execution details section of the Scheduler Web Administration Tool to get more detailed information on the last execution result.

  • Cached tuples. The number of tuples/documents that were extracted in the last execution of the job.

To create a new cache job, click the Add refresh job in Scheduler dropdown and choose the type of cache job you want to create: Simple Cache Management or DAG Cache Management jobs. This will open the Scheduler Web Administration Tool in a new tab to create a new cache job of the selected type with some of the fields already filled.

How This Works

  • When a Virtual DataPort server requests a license to Solution Manager, Solution Manager returns, among other things, the environment identifier and the cluster identifier of Solution Manager to which this Virtual DataPort server belongs.

  • When a user logs into a Virtual DataPort server with Design Studio, Virtual DataPort returns the environment identifier and the cluster identifier.

  • When a user opens a view that has cache enabled and expands the section Cache refresh jobs, Design Studio does this:

    1. It sends a request to the API of the Solution Manager to obtain the Scheduler server’s hostname registered in the same environment and cluster as this Virtual DataPort.

      If there are multiple Scheduler servers in the same cluster, the enabled one will be selected first. If none of them are enabled, then a disabled one will be chosen.

    2. It sends a request to the Scheduler API to obtain the information about the jobs that refresh the view’s cache.

Design Studio will connect to the Scheduler Web Administration Tool deployed in the same Tomcat of the Solution Manager installation.

Pre-Requites to Use this Feature

  • Follow the steps of the appendix Setting-up Features of Design Studio that Rely on Solution Manager

  • In Solution Manager:

    • The Scheduler server that loads the cache of the views of the Virtual DataPort you connect to has to be registered on the same environment and cluster.

    • If the cluster has multiple Virtual DataPort servers, all must have the same metadata (either having a shared metadata database or just having the same elements).

    • If the cluster has multiple Scheduler servers, all must share the same metadata database.

Configuration Steps

  1. Follow the steps of the section Configuration Steps

  2. Edit the <SOLUTION_MANAGER_HOME>/conf/design-studio/DesignStudioBackend.properties and change the value of these properties:

    • scheduler.api.host: the hostname of the Scheduler Web Administration Tool of the Solution Manager. If the Scheduler Web Administration Tool is configured to use SSL, ensure the hostname of the machine matches with the SSL certificate’s Common Name (CN) or Subject Alternative Name (SAN). 127.0.0.1 by default.

    • scheduler.api.port: the port of the Scheduler Web Administration Tool. 19090 by default.

    • scheduler.api.ssl: if the Scheduler Web Administration Tool is configured to use SSL or not.

    • scheduler.api.context-path: (optional) the context path of the Scheduler Web Administration Tool. /webadmin/denodo-scheduler-admin/ by default.

  3. Restart Design Studio.

Add feedback