Slowly changing dimensions (SCD) are data warehouse dimensions that store and manage both current and historical data over time. Examples of such dimensions could be: customer, geography, employee. The dimension tables are structured so that they retain a history of changes to their data. This record of data changes provides a basis for analysis. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule.
The three most common types of slowly changing dimensions are defined as follows:
- Type 1: No history of data changes. An update on the dimension table overwrites the specified columns without retaining a history of the changes.
- Type 2: Full history of data changes. Tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made. There are different ways to implement Type 2 SCD, for example using a column to indicate the most updated row:
or using effective date columns:
- Type 3: Limited history of data changes. Maintains a limited history of changes using multiple columns for selected variables. For example, a Type 3 dimension table containing customer information has columns named New Postal Code, Old Postal Code, and Oldest Postal Code. Data is moved from column to column during the loading process. Type 3 SCD has less analytical value than Type 2 SCD.
One of the most common scenarios in data warehouse modelling is to keep historical data using Type 2 SCD. In this article we will show how to model this approach in Denodo.
Modelling in Denodo
We will use an example based on the business model defined in the TPC-DS benchmark. TPC-DS models the decision support functions of a retail product supplier. The supporting schema contains business information, such as customer, items, promotions…
The simplified version that we will use consists of multiple dimension tables and a fact table (store_sales). The dimension tables can be classified into one of the following types:
- Static: The contents of the dimension are loaded once during database load and do not change over time. The date dimension is an example of a static dimension.
- Historical: The history of the changes made to the dimension data is maintained by creating multiple rows for a single business key value. Each row includes columns indicating the time period for which the row is valid. Item is an example of a historical dimension.
- Non Historical: The history of the changes made to the dimension data is not maintained. As dimension rows are updated, the previous values are overwritten and this information is lost. Customer is an example of a Non-Historical dimension.
To simplify the queries done from reporting tools, one possible strategy is to expose a single view resulting from joining the fact table with all the dimensions, defining all the associations and primary keys of the model in VDP.
From the reporting tool, the user selects the fields and aggregations needed for the report, and the VDP optimizer will automatically remove the unused dimensions from the query to retrieve only the required data from the underlying sources.
To select the required data and avoid duplicates, when retrieving data from the historical dimensions, we will provide a mandatory input date, using a view parameter. To do that, after importing the base views in Denodo, we will create derived views over the historical dimensions.
For example, we can create the derived view for the dimension ‘store’ following these steps:
- Create a derived view of type ‘selection’ and create a view parameter in the model tab:
- Add a selection condition using the new view parameter in the ‘Where Conditions’ tab:
The next step will be to create the join between the facts table and the dimensions:
At this point, if we try to execute the resulting join view, VDP will require the user to provide three input parameters, once for each historical dimension:
When dealing with large view hierarchies, this can become difficult to manage, and the user in most cases will want to provide the same value to all the parameters, since that is what guarantees temporal consistency.
To avoid this issue, the recommended practice will be to create a new view parameter in the join view, and create Where Conditions associating the new parameter with the dimension parameter. This allows to remove the mandatoriness from the dimension fields and the resulting view will have only one mandatory field:
Note that having defined the primary keys and associations allows the Denodo optimizer to access only the tables needed to resolve the query. For example, the following query, which obtains the total sales by customer and item brand:
SELECT sum(ss_sales_price), c_customer_id, i_brand FROM sales_join_parameter WHERE date_at = '01-01-2005'
GROUP BY c_customer_id, i_brand
When executed in VDP, it will not access other dimension tables such as date, store and promotion. This is the query pushed down to the database:
sum( s0.ss_sales_price) AS sum_1,
s0.c_customer_id AS c_customer_id,
s0.i_brand AS i_brand
t2.c_customer_id AS c_customer_id,
s1.i_brand AS i_brand,
t0.ss_sales_price AS ss_sales_price,
? AS date_at
? AS input_date
t1.i_rec_start_date <= str_to_date( ? , ? )
AND t1.i_rec_end_date >= str_to_date( ? , ? )
ON ((t0.ss_item_sk = s1.i_item_sk) )
ON ((t0.ss_customer_sk = t2.c_customer_sk) )
The Execution Trace shows the optimizations applied:
Another possibility to implement Type 2 SCD for historical dimensions is to use a column that indicates if the row contains the current value. For example, to keep changes in the dimension customer:
If we want to obtain always the current value, the steps to model the dimension in VDP would be:
- Create a selection view over the dimension base view with a condition to select the current value:
- Set the correct primary key in this derived view. As we are selecting only one row per customer, the customer id can be the PK.
- Use this derived view to create the join with the facts table.