Applies to:
Denodo 8.0
,
Denodo 7.0
Last modified on: 18 Jun 2020
Tags:
Combination
Optimization
Introduction
Slowly changing dimensions (SCD) are data warehouse dimensions that store and manage both current and historical data over time. Examples of such dimensions could be: customer, geography, employee. The dimension tables are structured so that they retain a history of changes to their data. This record of data changes provides a basis for analysis. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule.
The three most common types of slowly changing dimensions are defined as follows:
or using effective date columns:
One of the most common scenarios in data warehouse modelling is to keep historical data using Type 2 SCD. In this article we will show how to model this approach in Denodo.
Modelling in Denodo
We will use an example based on the business model defined in the TPC-DS benchmark. TPC-DS models the decision support functions of a retail product supplier. The supporting schema contains business information, such as customer, items, promotions…
The simplified version that we will use consists of multiple dimension tables and a fact table (store_sales). The dimension tables can be classified into one of the following types:
To simplify the queries done from reporting tools, one possible strategy is to expose a single view resulting from joining the fact table with all the dimensions, defining all the associations and primary keys of the model in VDP.
From the reporting tool, the user selects the fields and aggregations needed for the report, and the VDP optimizer will automatically remove the unused dimensions from the query to retrieve only the required data from the underlying sources.
To select the required data and avoid duplicates, when retrieving data from the historical dimensions, we will provide a mandatory input date, using a view parameter. To do that, after importing the base views in Denodo, we will create derived views over the historical dimensions.
For example, we can create the derived view for the dimension ‘store’ following these steps:
The next step will be to create the join between the facts table and the dimensions:
At this point, if we try to execute the resulting join view, VDP will require the user to provide three input parameters, once for each historical dimension:
When dealing with large view hierarchies, this can become difficult to manage, and the user in most cases will want to provide the same value to all the parameters, since that is what guarantees temporal consistency.
To avoid this issue, the recommended practice will be to create a new view parameter in the join view, and create Where Conditions associating the new parameter with the dimension parameter. This allows to remove the mandatoriness from the dimension fields and the resulting view will have only one mandatory field:
Note that having defined the primary keys and associations allows the Denodo optimizer to access only the tables needed to resolve the query. For example, the following query, which obtains the total sales by customer and item brand:
SELECT sum(ss_sales_price), c_customer_id, i_brand FROM sales_join_parameter WHERE date_at = '01-01-2005' GROUP BY c_customer_id, i_brand |
When executed in VDP, it will not access other dimension tables such as date, store and promotion. This is the query pushed down to the database:
SELECT sum( s0.ss_sales_price) AS sum_1, s0.c_customer_id AS c_customer_id, s0.i_brand AS i_brand FROM ( SELECT t2.c_customer_id AS c_customer_id, s1.i_brand AS i_brand, t0.ss_sales_price AS ss_sales_price, ? AS date_at FROM ( tpcds.store_sales t0 LEFT JOIN ( SELECT t1.i_item_sk, t1.i_brand, ? AS input_date FROM tpcds.item t1 WHERE t1.i_rec_start_date <= str_to_date( ? , ? ) AND t1.i_rec_end_date >= str_to_date( ? , ? ) ) s1 ON ((t0.ss_item_sk = s1.i_item_sk) ) ) LEFT JOIN tpcds.customer t2 ON ((t0.ss_customer_sk = t2.c_customer_sk) ) ) s0 GROUP BY s0.c_customer_id, s0.i_brand |
The Execution Trace shows the optimizations applied:
Another possibility to implement Type 2 SCD for historical dimensions is to use a column that indicates if the row contains the current value. For example, to keep changes in the dimension customer:
If we want to obtain always the current value, the steps to model the dimension in VDP would be: