Business problem

Scenario

We work for a company that sells outdoor furniture and equipment, PPY International. The company sells its products in several countries in North America and Europe, both to end-customers and distributors.

The situation we face today is that we have received an urgent request from the VP of Marketing. She needs a report on the performance of the different marketing campaigns that have been running in different countries and regions, to plan and budget for next year's campaigns. The campaign data is attached to her email as an Excel file.

The first problem that we have is that the information needed to create that report is not readily available in a single place and in an unified format. The Marketing data is in an Excel file, and, for this typical example, we have the product and sales information in a relational database, stored in a star schema. This is a fairly simple situation, but it is common to find data spread across five or more different data sources.

Once we get the data together, we find another problem: information that comes from different sources usually needs to be normalized so it can be combined together in a meaningful way. In our case, the most obvious case for normalization will be with the country names used in the Excel spreadsheet (see below).

Finally, traditional approaches to business intelligence tend to result in building the same logic over and over again across different BI and Reporting tools for generating similar, or even the same reports, in different places or by different departments. Having the information centralized by the Data Virtualization layer means that the business logic needed to create all the reports will be in a single spot, so it can be easily maintained and modified, and it can be easily shared with a set of external BI tools without having to replicate code across them.

Let's take a look at the data sources of our example.

Data Sources

Marketing promotions

The first data source is the Excel file that we received from Marketing. It has two worksheets:

  1. The Marketing promotions sheet contains information about the promotions that the marketing department carried on in the past. Each marketing promotion has a start and an end date, a country where it ran and a product that applied to, plus several other fields such as the discount per item, the type of promotion, etc.

  2. The second sheet contains a lookup table. If we take a detailed look at the first sheet, we will see that the country names have a variety of spellings and formats. It is very common for some of the data to be typed manually - and, in this case, data tends to apply the format rules in a rather flexible way.
    To solve this issue and normalize the data, we have chosen to quickly write a lookup table within the Excel file itself, in a separate worksheet, to be able to go from the incorrect name to the standardized one.

Products and sales

The other main data source we have is our central products and sales database. Let's take a look at the schema:

The database is structured as a star schema, with the main sales fact table and the areas, dates, ordermethods and products dimension tables. Each table contains the following information:

  • Sales: this is a fact table that contains one entry per individual item transaction. It has an integer ID for the sale and the rest of fields are foreign keys to entries in the other dimension tables (dates, products, ordermethods and areas), and an ID for the marketing promotion id that this item was sold under, if any (null otherwise).

  • Areas: the areas dimension table contains information about the countries where our company operates in.

  • Dates: a dimension table with IDs for all the dates that are referred to by events in our database.

  • Ordermethods: a dimension data for which methods were used to order the product in a sale.

  • Products: the main product reference in our database, contains some generic product data.

You can find all the data in this zip file, that contains the Excel spreadsheet and a .sql file that you must load in a MySQL server (see the Data Virtualization Basics tutorial for more information about setting up your MySQL instance).