Lookout
Expert trails guide Denodo users through all the relevant materials related to a specific topic, including official docs, KB articles, training, Professional Services offerings, and more. The main goal is to give users a single place with references to all the information that they need to become a Denodo expert on any specific topic.
“Why do I need materialization?” I thought Data virtualization is a real-time integration but materialization (pka caching) is replicating data. These are some of the questions that might come up with respect to materialization.
In this Expert Trail, you will gain knowledge of materialization, learning how to model it with best practices, its various purposes, and advanced configurations.
The Hike
Stage 1: Purpose of Materialization
The Denodo Platform utilizes a virtual-first, zerocopy approach built on Data Virtualization. The advanced execution engine is optimized for querying diverse external data sources in single or federated environments. The flexible logical model allows seamless integration of multiple access strategies, including real-time Virtualization, Caching, and selective Materialization to data warehouses or data lakes. This ensures flexibility while preserving full logical definitions and lineage throughout the data lifecycle for complete auditing and transparency.
Denodo's materialization module is designed to store local copies of source data when necessary. It is important to note that materialization is intended to complement a virtual-first approach, not replace it, distinguishing it from traditional replication methods like ETL. Materialization serves several key purposes, including:
- For Performance: When you integrate various data sources, there might be a situation where some of the data sources are slower than others (e.g. REST API). In these situations, you can materialize data from the slow data sources and use the materialized data in response to any queries against that data source.
- Transparency and Lineage: Denodo’s materialization feature delivers performance with complete transparency. While Denodo is data-agnostic by default, it allows materialized data to be stored on external databases, making its location fully visible. Every materialization is traceable through execution lineage, so you always know what data was used and where it came from—ensuring trust, governance, and auditability without sacrificing speed.
- To optimize frequently used queries: When there is a pattern of queries with a high frequency of users calling for the same data, these queries can be materialized. The real-time needs of such queries must be analyzed to determine the time to live in the materialization database.
- To minimize source system impact: What happens if everyone starts querying my operational systems in real-time? What will be the performance impact on my operational users who depend on these systems? The Denodo Platform can serve a real-time view of data to certain priority users and partially materialize data to others to minimize source impact while meeting differentiated user needs.
- To protect against intermittent system availability: Consider a scenario where a data source in a regional sales office might only be available during local office hours. Materializing data from these sources can help mitigate against the actual source data not being available.
Let's continue this path by checking the Best Practices and recommendations for different aspects of the Materialization Module such as, how to choose the materialization database, how to decide what views to materialize, or what is the best materialization mode and refresh strategy for each particular use case:
Best Practices to Maximize Performance III: Materialization
For information on other performance optimization techniques, review the Expert Trail: Query Performance Optimization
SUMMARY |
FULL MATERIALIZATION |
QUERY RESULTS (pka partial) |
REMOTE TABLES (link) |
TEMPORARY TABLES (link) |
|
Use cases |
|
|
|
|
|
Usage |
Dynamic: Optimizer can decide to leverage a Summary on a query execution |
Always. Data is always retrieved from materialization. |
From materialization if possible. Otherwise from underlying source. |
Always from remote table |
Always from temporary table. Only available to the user session |
Linked to the logical model |
FULL lineage, invalidation on changes |
FULL lineage, invalidation on changes |
FULL lineage, invalidation on changes |
Only the SQL query used |
None |
Data Load mechanisms |
|
|
|
|
|
IDU |
N/A |
N/A |
N/A |
Insert, Delete, Update |
Insert |
Target Materialization Source |
Any |
Cache data source |
Cache data source |
Any |
Cache data source |
Table life duration |
Manual drop |
|
|
Permanent (Can be manually dropped) |
Session |
Stage 2: Materialization Configuration
Initial setup:
To make use of the materialization, the first and foremost step is to configure it. The document Configuring the Materialization (p.k.a. Cache) explains how to enable the materialization in the Server.
After this, the materialization can be configured for a specific view that you want its data to be materialized. The document Configuring the Materialization of a View explains how to do this
Loading/refreshing the materialization views:
Materialized data is liable to become out of date. Therefore, you may need to establish a data refresh plan periodically. Our recommendation is to use Denodo Scheduler to automate the materialization load process, may it be an incremental or complete load of the materialized view.
- For Query results, you only need to configure materialization refresh processes when using the 'explicit' option.
- For Full mode, you always have to fill in the materialization table with explicit loads. After that, you need to refresh the data periodically.
- If the schema of your materialization views changes at the data source(s) Denodo (introduced in Denodo 9.3) offers a new feature; to easily adjust your views to these changes take advantage of Schema Evolution.
- For Recreate schema: This is the default behavior. If a view’s schema is updated and it no longer matches the materialized table, VDP recreates the table, resulting in the loss of all previously materialized data.
- For Adapt schema: When this option is selected, VDP will attempt to update the materialized table to match the view’s new schema. Under this option, columns will not be automatically dropped unless the Column cleanup option is selected. If that option is not selected, you can drop them manually using the Force column cleanup button.
Refreshing the materialized view using a DAG cache job (Recommended):
The Scheduler in Denodo allows for a simpler and easier way to load materialized tables in parallel subject to dependency constraints using DAG cache jobs. By choosing elements graphically such as a view, a set of views, a folder, a virtual database, or the full Virtual DataPort server a DAG cache job intelligently identifies the materialization-enabled views and summaries within the selected elements. Once these have been selected, the job will dynamically create the load processes to load materialized tables or summary refreshes in the right order.
For prior versions or for customized and more flexible controls, use the Cache Load Process.
Refreshing the materialized view incrementally (delta):
The materialization in Denodo allows it to be loaded incrementally, instead of loading the entire view. Using this feature, only deltas will be added to the materialized table.
For loading full materialization incrementally, you can configure the Materialization (p.k.a. Cache) Load Process (introduced in Denodo 8.0 Update 20210715) to reference the time of the last execution when doing an incremental materialization load and allows to have "matching_pk" for invalidation, With this the materialization engine uses the values of the primary key to identify which rows of the materialization table have to be inserted and which ones have to be updated. For example, the below syntax is supported:
- SELECT * FROM view WHERE <fieldName> > @LAST_REFRESH_DATE CONTEXT( 'cache_preload'='true','cache_wait_for_load'='true', 'cache_invalidate' = 'matching_pk')
Materialization Maintenance:
When the content of a materialized view is invalidated or reaches its time to live, expired/invalidated rows will not appear in query results. Denodo provides a materialization maintenance task that periodically removes these rows. But, in production environments, the recommended setting is to use the CLEAN_CACHE_DATABASE predefined stored procedure instead.
Stage 3: Modeling
Depending upon your scenario selecting the right materialization mode plays a vital role.
- When to use Full Materialization:
- For example, If you want to delegate complex operations like joins, unions, group by, etc involving several views from different data sources to the materialization database. In this way, the performance of these operations is significantly improved.
- When to use Adapt Schema:
- If you are expecting source schema changes frequently, and want your materialized table to change intelligently to preserve the existing data with the changes implemented (introduced in Denodo 9.3). Please note that when the source schema is changed, be sure to reload the materialization table as the newly added column will have null values until the next reload.
- When to use Incremental Load:
- If you want to load the cache incrementally in Denodo to improve data retrieval efficiency and minimize the load on source systems, this can be configured while the Full Materialization (and executed via the Denodo Scheduler) to update the cache with new or changed data.
- When to use Incremental mode (designed for APIs):
- If you want to merge the materialized data with the delta records (i.e. records inserted into the source after the materialization is preloaded), then incremental data load mode is a good fit.
- When to use Query results (pka partial cache) mode:
- This mode does not force you to have all the tuples of the view in the materialization table. For example, a REST API with input parameters is a perfect fit. The document Query Results Type provides the details for configuring this mode.
- When to select the ‘Explicit Loads’ Option:
- This option is a good fit when a relatively small subset of the data is queried much more frequently than the rest and, in addition, it is easy to predict which subset is the most frequently queried. The document Explicit Loads provides the details for configuring this mode.
- When to select the ‘Auto’ Option:
- This option should be used when the data source does not return all the results for a certain query. For instance, many websites and web services return only the top ‘n’ results for queries that have too many results. In such cases, unchecking this option could return incomplete results for certain queries. The document Match Exact Queries Only provides the details for configuring this mode.
An additional consideration that should be taken into consideration is described in the Fine-grained privileges and materialization (p.k.a. cache) best practices KB article.
Stage 4: Advanced Configuration
With a better understanding of what materialization can do, now it is time to take a look at some additional configurations or considerations.
Bulk Data Load: Some databases also provide bulk load API options to optimize loading big amounts of data, unlike the default ‘batch insert’ capabilities. This may also imply limitations on the atomicity of the materialization loading operation and Materializing (p.k.a. Cache) Very Large Data Sets documentation provides several recommendations for it.
Materialization Table Indexes: Indexes improve the speed of data retrieval operations on database tables. Thus, defining materialization table indexes for a view can take advantage of the indexes' benefits.
Specific conditions: There are some specific conditions you have to take into account depending on the database you use to store materialized data. The document Specific Information about Materialization Databases explains in detail this.
Estimate size of materialization database:
There are several factors involved in estimating the space required for the materialization, and they can be highly dependent on the characteristics of the selected database system. The document Materialization database size estimate explains more in detail.
Materialization table name manipulation: Working with Materialization views with an external database, Custom table name and Fetch table name are some useful features that are new in Denodo 9.3. The “Custom table” name option can be used to provide our desired table names in the materialization database. The “Fetch table name” option can be used to retrieve the table name that is currently in use for the materialization of that particular view
Stage 5: Summaries
Apart from the traditional cache, Denodo 8 includes smart query acceleration techniques using a new type of view called Summaries or Smart Materialization. These store common intermediate results that the query optimizer can then use as a starting point to accelerate analytical queries.
Summaries do have some advantages over the traditional materialization:
- Users need to explicitly reference the materialized view to use the precomputed data. Whereas, summaries are transparently used by the query optimizer, which analyzes and rewrites the incoming queries to take advantage of the data in the summary whenever possible.
- Summaries are not restricted to the materialization data source. Meaning, when you create a summary you can choose the best data source alternative for that particular data, and also store the same data in multiple data sources so the optimizer can choose the best option for each case.
However, when it comes to materialization it does not require a preprocessing phase to decide whether to use it or not. So in cases where you know you always want to access a certain data in the same way as you usually do, it is better to continue using traditional materialization, to avoid any unnecessary overhead. Also, materialization offers more refresh and management options, like implicit materialization mode, incremental queries, atomic or non-atomic loads, and selective invalidation.
Stage 6: Troubleshooting
Finally, the latest step is to learn to debug and identify the possible reasons when a certain problem arises.
For example, what if the materialization server goes down for any reason, in this case, the behavior of the VQL queries executed on materialized views will depend on the materialization mode: What happens if the materialization database goes down?
Monitoring the materialization process can also help debug any issues that might occur during the load process on the Virtual Data Port server. The document Monitoring - Materialization shows the different fields that are available in the “Materialization Load Processes” table and are useful for debugging. More information about monitoring can be found in the Expert Trail: Monitoring.
Exploration
Fill up your backpack with additional gear:
Materialization Best Practices |
|
Official Documentation |
|
KB Articles |
|
Additional Resources |
|
Materialization related Predefined Stored Procedures |
|
Official Documentation |
|
Smart Query Acceleration (Summaries) |
|
Official Documentation |
|
KB Articles |
|
Webinars |
|
Additional Resources |
|
Data Movement (using Materialization) |
|
Official Documentation |
|
Guided Routes
Denodo Training Courses
Denodo training courses provide expert data virtualization training for data professionals, including administrators, architects, and developers.
If you are interested in Materialization you should enroll in the following course/s:
- Mechanisms for Persisting Data with Denodo: This course covers when and how Denodo can store data itself (Materialization / Materialized tables / Temporary tables / Summaries) and how to manage that data (see the data, remove/invalidate data,...).
- Advanced Features of Denodo Views: This course covers the configuration options available in any Denodo view.
- Beyond Query Optimizer: A Journey into Query Optimization Excellence: This course will talk about the internal details of the Denodo Optimizer to learn how to maximize the performance of the queries executed in the Denodo Platform.
Success Services
Denodo Customer Success Services can help you at the start or any part of your Self-Service Analytics trail. You can find information about the Denodo Success Services offering in:
Advisory Sessions
Denodo Customers with active subscriptions have access to request Meet a Technical Advisory sessions.
These are the sessions available related to Materialization.
Infrastructure implementation |
Materialization: Configuration & Administration Best Practices |
- Configure the materialization data source/s using a supported database. - Configure Bulk Load for materialization. - Set up Materialization Maintenance Task. - Monitoring the materialization database. - Smart Materialization: Summaries |
Materialization: Standards & Best Practices |
Materialization Modes Overview |
Review and showcase how the Denodo Materialization works: - The different materialization modes (Partial, Full, Incremental) - Loading the materialization and refreshing. - Invalidating the materialization tablecontents. - Indexes. |
Materialization Best Practices |
- Assist you in defining a policy to determine when to use the materialization and what type to use, or review your current policy. - Advice on selecting the best materialization strategy for a specific scenario. - Incremental materializing strategies. |
|
Summaries Overview and Best Practices |
Review and showcase how the Denodo Summaries works. |
|
Denodo Scheduler: Standards |
Schedule Materialization Load |
Assist you on your first steps creating materialization jobs in Scheduler. |
Success Accelerators
In addition to Advisory sessions, Success Services includes Success Accelerators that can help you.
- Operations Quick Start
- Development Quick Start
If you are a Denodo customer, you can reach out to your Customer Success Manager for details about any Guided Route that you need.
Big Hike Prep Check
Let’s see if you are ready to start your big trail. Take this 3-question questionnaire to check your readiness for an enjoyable hike.
Read the questions below, think about the solution and check if you got them right by looking at the solution. Have you become an expert?
- Consider you have a derived (join) view on three base views created over three data sources (1 - JDBC , 2- SAP , 3 - JSON (Rest API)). Looking at the execution trace you have identified the query performance is affected by JSON data source. Where would you apply materialization? Is it at JSON base view level or at derived view level?
Click here to check if you got it right
The recommendation in this scenario would be at JSON base view level. From the execution trace you have identified the post processing in Denodo is not a problem but the network latency of JSON affects the query performance. |
- Consider you have a Full materialization on a base view and have preloaded with 1 Million rows. The data in the source does not change much, probably a few hundred rows are added every week. So you decided to schedule a materialization preload once in a month. Now you have a requirement that the end users also need to view the changes from the data source (i.e. those few hundred records). What would be the best approach here?
Click here to check if you got it right
Incremental Full materialization mode is a good option for this scenario. For more details, refer to Requirements a View Has to Meet to Support Incremental Queries. |
- In Production, you noticed there is a deadlock error in Denodo Scheduler between a materialization preload job and a materialization maintenance job. What could have caused this deadlock?
Click here to check if you got it right
This may happen when the materialization maintenance job is configured to run at the same time of the materialization preload job. There are 2 possibilities:
In Production environments, the recommendation is to set the “Maintenance” option to “Off” to avoid executing the Materialization Maintenance task periodically and having control over when the maintenance task is executed. Thus, use the Denodo Scheduler to program the execution of the CLEAN_CACHE_DATABASE procedure at a period where the Server and the materialization database are not expected to be under heavy load, thus avoid having materialization load job and CLEAN_CACHE_DATABASE to run in parallel. |
