The Denodo Platform Optimizer utilizes a variety of advanced techniques that allow integrating disparate data sources in real time for data-intensive scenarios. See References section below for further information regarding the Denodo Optimizer.
With that said, sometimes fully optimized queries (even queries completely pushed down to a single data source) can not fulfill demanding SLA’s in particular use cases. For example when the query is very complex or the data source is not powerful enough.
In those scenarios, pre-calculated results stored in the cache can be the solution.
In this context, with cache being a form of replication, the question that arises is: What is the difference between using cache in Denodo and other replication approaches, like ETL?
Let’s take a look in detail at the differences and benefits of using Denodo cache in those scenarios:
- Replicate only what is needed: Let's say we are creating a new data mart. With caching, only the necessary data is replicated. This may be the data of certain sources or only certain data subsets of a given source (e.g. the most frequently used). Data that does not require caching is accessed in real-time.
- A physical data mart using ETL, requires to replicate all the required data.
- Additionally, Denodo Smart Query Acceleration allows creating Summaries which are pre-aggregated and pre-combined small data subsets which the optimizer can use to accelerate many queries and reports.
- Decoupling/Abstraction: Client applications are not tied to the data sources. Denodo’s abstraction capabilities give flexibility and control to store information wherever is best. The cache can be in the same system where the data originally resides, or in some other 3rd party system that is better suited to your scenario. Cache is transparently managed through Denodo. No direct interaction to the cache data source needed.
- Switch to real-time with one click and/or use real-time access for certain users/queries. Denodo allows having certain users to access real-time data while forcing others to use cached data to protect slow data sources.
- With an ETL, if the target for the table changes (e.g. to a new more powerful database), the client application needs to be modified. If some clients need to access the original data in real time, they will have to use yet another connection, etc.
- Easy extraction process definition/Performance impact: Denodo views are declarative. Combinations and transformations are defined in VQL language but the query optimizer is in charge to automatically choose the best plan to combine and transform the source data to obtain the desired result. Enabling caching for a view in Denodo takes one click. Scheduling refreshes can be easily set up in Denodo’s Scheduler.
- ETL flows are procedural, therefore the flow creator needs to specify exactly the operations required to extract and transform the source data, and also the order to execute them.
- The alternative approach to ETL called ELT (Extraction, Load and Transform) can be declarative in some tools but forces to perform all the transformations and combinations in the target system, which is often impractical.
- Denodo’s Remote Tables functionality allows to ETL efficiently persisting the result of a Denodo view on a table of an external database.
Let's consider the following scenario:
We want to obtain the total number of sales in summer of each product by day:
Denodo can automatically decide for instance to:
- Push down the join between sales and date_dim. Also push down the filter for the summer days.
- Apply the Partial aggregation push down optimization, so the group by operation is divided in two parts: one part that can be pushed down below the join operation, so in this case Netezza executes a group by, and a second part which remains on top of the join to obtain the final results.
- Join data coming from Netezza and Oracle using the best join execution method for this case e.g. MERGE join
- Execute the second group by on top of the join required by the partial aggregation pushdown optimization.
This plan will probably perform much better than a naive ETL flow which simply retrieves all the data, and then performs the join and the aggregation in the ETL tool. While it is possible to create a ETL flow which implements the same execution plan as Denodo, the user needs to define each step manually and also needs to manually write the SQL queries pushed down on each source, since ETL tools do not automatically detect that a part of a flow can be pushed down to a data source. This may create several problems:
- Definition complexity: If the queries that need to be pushed down to each data source involve many operations, writing them can be hard. These queries will also be black boxes from the point of view of the ETL flow
- Performance impact: Performance is more dependent on the developer. The flow creator may fail to identify the best execution plan. For instance, they may fail to identify that partial aggregation pushdown or on the fly data movement are good strategies for a particular flow.
- On Change impact: Minor changes in the flow can totally change the best execution plan. For instance adding a new filter may imply that a different JOIN Strategy should be used to maximize performance. Removing or adding a group by field or a filter may mean that an aggregation pushdown is no longer the best option. In Denodo the execution plan adjusts automatically when there is a change in the definition.
- Easy to maintain: If any change is required, it can be rapidly implemented, just involves metadata modifications. Denodo includes source change detection, and impact analysis functionalities to address changes in the sources or in the data models easily
- ETL jobs are harder to edit and modify when changes happen.
- Reusability: Intermediate views can be used, as if they were layers, by other more refined views, tailored to end users. If intermediate views are modified, those changes will automatically reflect on the final models, simplifying even more the maintenance cost
- ETL jobs will have to be created individually, and changes in common parts will have to be repeated for each one of them, increasing even more the maintenance cost.
- Governance: Data lineage and traceability of the original sources is maintained.
- When ETL the content to a new table, the lineage is lost. New consumers will just use the new table, without any reference to where that data is coming from
In order to help out better determining the caching strategy in Denodo we recommend reading the article Best Practices to Maximize Performance III: Caching
In analytical scenarios, the new query acceleration capabilities available in Denodo 8 can also be very helpful. Although they share many of the benefits of caching, query acceleration techniques are more flexible and adapt better to self-service situations.
To learn more about query acceleration, please check the following webinar: Accelerate your Queries with Data Virtualization
There are also some scenarios where an ETL job will be more adequate:
- It is necessary to normalize data written into multiple tables, and therefore requires managing the generation of surrogate keys.
- Need to deal with historical changes like slowly changing dimensions.
- Flows that have multiple outputs depending on data (e.g. some data quality flows where "good" lands in one place and "bad" follows a different path)
- Operations are not easily done in relational queries (e.g. multipass, pivots, etc.)
The above mentioned are operations usually performed to define a EDW, or in some data quality processes.
Denodo also provides guidance on DV and ETL and when they can be used together.