How to model slowly changing dimensions in Denodo

Applies to: Denodo 8.0 , Denodo 7.0
Last modified on: 18 Jun 2020
Tags: Combination Optimization

Download document

You can translate the document:

Introduction

Slowly changing dimensions (SCD) are data warehouse dimensions that store and manage both current and historical data over time. Examples of such dimensions could be: customer, geography, employee. The dimension tables are structured so that they retain a history of changes to their data. This record of data changes provides a basis for analysis. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule.

The three most common types of slowly changing dimensions are defined as follows:

  • Type 1: No history of data changes. An update on the dimension table overwrites the specified columns without retaining a history of the changes.

  • Type 2: Full history of data changes. Tracks historical data by creating multiple records in the dimensional tables with separate keys. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made. There are different ways to implement Type 2 SCD, for example using a column to indicate the most updated row:

or using effective date columns:

  • Type 3: Limited history of data changes. Maintains a limited history of changes using multiple columns for selected variables. For example, a Type 3 dimension table containing customer information has columns named New Postal Code, Old Postal Code, and Oldest Postal Code. Data is moved from column to column during the loading process. Type 3 SCD has less analytical value than Type 2 SCD.

One of the most common scenarios in data warehouse modelling is to keep historical data using Type 2 SCD. In this article we will show how to model this approach in Denodo.

Modelling in Denodo

We will use an example based on the business model defined in the TPC-DS benchmark. TPC-DS models the decision support functions of a retail product supplier. The supporting schema contains business information, such as customer, items, promotions…

The simplified version that we will use consists of multiple dimension tables and a fact table (store_sales). The dimension tables can be classified into one of the following types:

  • Static: The contents of the dimension are loaded once during database load and do not change over time. The date dimension is an example of a static dimension.
  • Historical: The history of the changes made to the dimension data is maintained by creating multiple rows for a single business key value.  Each row includes columns indicating the time period for which the row is valid. Item is an example of a historical dimension.
  • Non Historical: The history of the changes made to the dimension data is not maintained. As dimension rows are updated, the previous values are overwritten and this information is lost. Customer is an example of a Non-Historical dimension.

To simplify the queries done from reporting tools, one possible strategy is to expose a single view resulting from joining the fact table with all the dimensions, defining all the associations and primary keys of the model in VDP.

From the reporting tool, the user selects the fields and aggregations needed for the report, and the VDP optimizer will automatically remove the unused dimensions from the query to retrieve only the required data from the underlying sources.

To select the required data and avoid duplicates, when retrieving data from the historical dimensions, we will provide a mandatory input date, using a view parameter. To do that, after importing the base views in Denodo, we will create derived views over the historical dimensions.

For example, we can create the derived view for the dimension ‘store’ following these steps:

  • Create a derived view of type ‘selection’ and create a view parameter in the model tab:

  • Add a selection condition using the new view parameter in the ‘Where Conditions’ tab:

The next step will be to create the join between the facts table and the dimensions:

At this point, if we try to execute the resulting join view, VDP will require the user to provide three input parameters, once for each historical dimension:

When dealing with large view hierarchies, this can become difficult to manage, and the user in most cases will want to provide the same value to all the parameters, since that is what guarantees temporal consistency.

To avoid this issue, the recommended practice will be to create a new view parameter in the join view, and create Where Conditions associating the new parameter with the dimension parameter. This allows to remove the mandatoriness from the dimension fields and the resulting view will have only one mandatory field:

Note that having defined the primary keys and associations allows the Denodo optimizer to access only the tables needed to resolve the query. For example, the following query, which obtains the total sales by customer and item brand:

SELECT sum(ss_sales_price), c_customer_id, i_brand FROM sales_join_parameter WHERE date_at = '01-01-2005'

GROUP BY c_customer_id, i_brand

When executed in VDP, it will not access other dimension tables such as date, store and promotion. This is the query pushed down to the database:

SELECT

   sum( s0.ss_sales_price) AS sum_1,

   s0.c_customer_id AS c_customer_id,

   s0.i_brand AS i_brand

FROM

   (

      SELECT

         t2.c_customer_id AS c_customer_id,

         s1.i_brand AS i_brand,

         t0.ss_sales_price AS ss_sales_price,

         ? AS date_at

      FROM

         (

            tpcds.store_sales t0

            LEFT JOIN

               (

                  SELECT

                     t1.i_item_sk,

                     t1.i_brand,

                     ? AS input_date

                  FROM

                     tpcds.item t1

                  WHERE

                     t1.i_rec_start_date <= str_to_date( ? , ? )

                     AND t1.i_rec_end_date >= str_to_date( ? , ? )

               )

               s1

               ON ((t0.ss_item_sk = s1.i_item_sk) )

         )

         LEFT JOIN

            tpcds.customer t2

            ON ((t0.ss_customer_sk = t2.c_customer_sk) )

   )

   s0

GROUP BY

   s0.c_customer_id,

   s0.i_brand

The Execution Trace shows the optimizations applied:

Another possibility to implement Type 2 SCD for historical dimensions is to use a column that indicates if the row contains the current value. For example, to keep changes in the dimension customer:

If we want to obtain always the current value, the steps to model the dimension in VDP would be:

  • Create a selection view over the dimension base view with a condition to select the current value:

  • Set the correct primary key in this derived view. As we are selecting only one row per customer, the customer id can be the PK.
  • Use this derived view to create the join with the facts table.

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training