AGILE BI

Welcome to the Denodo Agile BI Tutorial. In this guide, we will review the first steps of using Data Virtualization tools, the Denodo Platform specifically, to analyze enterprise data for gaining operational insight. This tutorial will showcase the main benefits that we get from using Data Virtualization versus connecting the BI tools directly to the data sources.

These are the lessons of this tutorial:

  1. Presentation of the Use Case.
  2. Use Denodo to Implement the Solutions:
  • Connect to Data Sources
  • Combine the Data
  • Create Final Reports to Publish Data for Clients
  1. Consume Data from a BI Tool.

Let's begin by reviewing the business problem at hand.

Scenario

We work for a company that sells outdoor furniture and equipment, PPY International. The company sells its products in several countries in North America and Europe, both to end-customers and distributors.

The situation we face today is that we have received an urgent request from the VP of Marketing. She needs a report on the performance of the different marketing campaigns that have been running in different countries and regions, to plan and budget for next year's campaigns. The campaign data is attached to her email as an Excel file.

The first problem that we have is that the information needed to create that report is not readily available in a single place and in an unified format. The Marketing data is in an Excel file, and, for this typical example, we have the product and sales information in a relational database, stored in a star schema. This is a fairly simple situation, but it is common to find data spread across five or more different data sources.

Once we get the data together, we find another problem: information that comes from different sources usually needs to be normalized so it can be combined together in a meaningful way. In our case, the most obvious case for normalization will be with the country names used in the Excel spreadsheet (see below).

Finally, traditional approaches to business intelligence tend to result in building the same logic over and over again across different BI and Reporting tools for generating similar, or even the same reports, in different places or by different departments. Having the information centralized by the Data Virtualization layer means that the business logic needed to create all the reports will be in a single spot, so it can be easily maintained and modified, and it can be easily shared with a set of external BI tools without having to replicate code across them.

Let's take a look at the data sources of our example.

Data Sources

Marketing promotions

The first data source is the Excel file that we received from Marketing. It has two worksheets:

  1. The Marketing promotions sheet contains information about the promotions that the marketing department carried on in the past. Each marketing promotion has a start and an end date, a country where it ran and a product that applied to, plus several other fields such as the discount per item, the type of promotion, etc.

  1. The second sheet contains a lookup table. If we take a detailed look at the first sheet, we will see that the country names have a variety of spellings and formats. It is very common for some of the data to be typed manually - and, in this case, data tends to apply the format rules in a rather flexible way.
    To solve this issue and normalize the data, we have chosen to quickly write a lookup table within the Excel file itself, in a separate worksheet, to be able to go from the incorrect name to the standardized one.

Products and sales

The other main data source we have is our central products and sales database. Let's take a look at the schema:

The database is structured as a star schema, with the main sales fact table and the areas, dates, ordermethods and products dimension tables. Each table contains the following information:

  • Sales: This is a fact table that contains one entry per individual item transaction. It has an integer ID for the sale and the rest of fields are foreign keys to entries in the other dimension tables (dates, products, ordermethods and areas), and an ID for the marketing promotion id that this item was sold under, if any (null otherwise).

  • Areas: the areas dimension table contains information about the countries where our company operates in.

  • Dates: a dimension table with IDs for all the dates that are referred to by events in our database.

  • Ordermethods: a dimension data for which methods were used to order the product in a sale.

  • Products: the main product reference in our database, contains some generic product data.: the main product reference in our database contains some generic product data.

You can find all the data in this zip file, that contains the Excel spreadsheet and a .sql file that you must load in a MySQL server (see the Data Virtualization Basics tutorial for more information about setting up your MySQL instance).

Overview of the Process

Let's organize all this data in a coherent view of the marketing promotion performance. The process will look like this:

  1. Connect Denodo to the data sources, to retrieve the data in real time.
  2. Combine the data performing the needed transformations and normalizations.
  3. Create our final reports ready for consumption.
  4. Publish the data for the client application.

To organize our virtual objects, we are going to create a database, named "bitutorial", and some virtual folders. Go ahead and create four folders, named "01-sources", "02-base views", "03-data model" and "04-reports".

Connecting to data sources

The first step is connecting to the data sources. As we saw in the Basic tutorial, Denodo will store just metadata - no data will be imported from the source (unless we explicitly tell Denodo to do so, through caching) instead the data will be retrieved in real time. We have two different sources, the Excel file and the database.

Database

The first connection will be to our database of sales and products. This is running in a MySQL instance so what we need to do is to right-click over the "01-sources" folder and select New > Data source > JDBC. That opens the JDBC data source dialog:

Here we do the following:

  • Type the name of the datasource, in our case "salesdb".
  • Select MySQL 5 as the database adapter.
  • Set the database URI. In our example the database server is running in localhost and the default port, so our URI is jdbc:mysql://127.0.0.1:3306/salesdb. Modify these settings as needed for your environment.
  • Set the login and password for your db.
  • Click on "Connection Pool configuration", and in the new dialog make sure "Test connections" is selected. Click "Ok".

  • Click .

After these steps we have created our data source for the sales database. When we click on Create base view Denodo will show us the introspection screen where we will select which tables are we going to import.

Go ahead and select all the tables under the "sales" database, and click Create selected. Below you can see all the schemas of the views that we are creating:

Excel marketing file

The second connection will be to the Excel file. This file contains two worksheets as discussed earlier; we have to consider them separately so we will create two different connections to the source, one for each worksheet.

To create the first one, we right-click over the "01-sources" folder and select New > Data source > Excel. In the following dialog:

  • Type the name of the datasource (for example, marketing).
  • Select the type of file, in this case Excel 2007 or later.
  • Choose the file location: select "Local" for a file in your local filesystem, and then click on "Configure", "Browse" and use the browse dialog to choose the file you downloaded.
  • Work sheet: the name of the worksheet is "Marketing promotions", without the quotes.
  • The start cell is A1, and end cell is I43.
  • Finally, make sure you have "Has headers" and "Stream tuples" selected.

Click , that will create the data source. Let's create a base view from it, by clicking on Create base view.

You will see the schema of the new view, change the data types as seen in the following screenshot:

and click to create the final base view.

Now it's time create the base view for the second worksheet. The process is almost identical to the first workseet. Start by right-clicking over the "01-sources" folder and selecting New > Data source > Excel, and configure the next dialog in the same way as before, except:

  • The worksheet is named "Country lookup".
  • The start and end cells are A1 and B36.

After clicking we have our data source ready:

And by clicking on Create base view we access the base view editing dialog, where we will edit the final view as follows:

Click . Now we have our two Excel views in addition to the views coming from the database.

We are almost ready, but let's first move the base views to the "02-base views" folder so they are well organized. Just select all of them and drag them to the destination folder. After that is done we will start doing data combinations.

With the data sources and base views in place we are in a position of starting to combine all the data. The first steps are to have a denormalized, standardized view of each data source so we get a set of base business entities that we will use later to build more complex reports.

Cleaning the Marketing names

First let's take a look at the marketing data. The main problem we have, as discussed earlier, is that the country names are not standardized due to a manual input process. To solve this we have the marketing lookup table ready, so let's go ahead and combine these sources:

  • Right-click over the "03-data model" folder and select New > Join.
  • Drag the "marketing" view onto the workspace.

  • Drag the "marketinglookup" view onto the workspace.

  • Connect the "area" column from the marketing table to the "marketingarea_0" column of the marketing lookup table.

  • In the Output tab, remove the "area" and "marketingarea_0" columns

  • Rename the view to "marketing_full" and click .

With these steps we have created a new view that contains the standardized names for each country, instead of the wrong ones that we originally had in the Excel file. To check that everything is working properly click on :

and then on Execute again:

We see that the data is being returned correctly and now the contry names are following our internal naming conventions.

Denormalizing the star schema

The next step to prepare our data for meaningful reporting is to denormalize the star schema we have in our database, so it's easier for users of this data to create dervied views aggregating or summarizing it. Remember, we have one fact table (sales) and four dimension tables (dates, products, ordermethods and areas).

We will do this transformation by using a five-way join:

  • Right-click over "03-data model" and select New > Join
  • Drag onto the workspace the views sales, dates, products, ordermethods and areas.
  • Connect the following columns by using drag & drop:

After connecting the four fields we should have the following state:

  • Now move to the output tab. First rename the view to "sales_full", then remove the duplicate fields and rename the fields that we are keeping so the schema looks like this:

Click to create the view. Now we have a view that contains the full sales data in a easy to process denormalized format. We are ready to connect the data that is coming from both sources.

Creating final reports

With the base business entities in place we can begin to answer business questions with the data.

Combining sales and marketing data

The first step is to combine the two main business entities that we have created so far, marketing and sales data, so we can analyze the correlation between the sales information and specific marketing promotions.

This is again a join, so follow these steps:

  • Right-click over the "04-reports" folder and select New > Join
  • Drag the views "sales_full" and "marketing_full" onto the workspace.
  • Connect the "sales_full.promotion" column to the "marketing_full.promotionid_0" column, using drag & drop.

  • Now a very important part; as the promotion field in the sales table will be non-null for sales under a promotion but null for organic sales, we need to change the join type so it does not filter out the organic sales. Assuming that you dragged the "sales_full" table first, we will do this by using a left outer join.

Just move to the Join conditions tab and then click on "Inner join", then "Change join type", then "Left outer". This will ensure we receive all of the sales rows.

  • Last, in the output tab rename the view to "full_sales_report" and remove the duplicate fields so we are left with the following schema:

Clicking will create the view. This view contains the full sales and marketing data; when we execute it Denodo retrieves the information from all the sources in real time and combines it to provide you the full view of the data.

Creating the final business reports

Once the full sales and marketing data is combined and we have a comprehensive, fully denormalized view of all our data, we can create the final report that we will use for answering the questions asked by our business requirements. What we want to get is a view of the total monthly sales under each promotion, split by country, for both organic sales and sales under marketing promotions.

We do this by using an aggregation:

  • Right-click on the "04-reports" folder and select New > Selection.
  • Drag the "full_sales_report" view onto the workspace.

  • Now, move to the Group by tab and select "Use group by". With the group by enabled, we will add the following fields as group by fields:
  • full_sales_report.country (selected from the list on the left side).
  • A new aggregation expression with the following specification:

case when (description is null) then false else true end

  • Another two aggregation expressions with the expressions:

getyear(saledate) getmonth(saledate)

With these group by fields we are aggregating all the monthly sales per country, divided in two groups: one for the promotion sales and one for the organic sales.

  • Let's move to the Output tab. Here, rename the view to "monthly_sales_country" and add the following aggregation expressions

  • And finally click to create the view.

If we execute it we will get these results:

You can see that for each country and each month, we have two rows: one for value true and other for value false for the "in_promotion" column, and then the total number of sales for each one of them.

Now we will create a summarized version of this report that aggregates the information per geographical area (North America and Europe). We could easily do this in any of the usual BI tools, but doing it in the virtualization layer means that it will be available to all client applications, so we don't have to replicate this logic across different tools.

  • Start by creating another selection view in the folder "04-reports".
  • Drag the "monthly_sales_country" onto the workspace.

  • Move to the Group by tab, check "Use group by" and then select as the group by fields all of them except country and totalsales:

  • In the output tab, rename the view to "monthly_sales_area" and add one aggregation field:

  • Finally, click .

Executing this view shows a similar result to the country sales, but now we have a lot less rows as we are aggregating all the countries per area together.

Publishing the data for clients

In this example we are going to access our report from Tableau. This BI tool uses ODBC to access Denodo and send standard SQL queries. By default Denodo offers all the views (base or derived) through JDBC and ODBC automatically; we only need to do manual publishing steps in the case of creating web services.

So in this case, this step is trivial: we do not need to do anything regarding report publication, we only need to consume it from Tableau.

Finally, the only thing that is left is to access the views we created in the Denodo Platform from a client tool. In this specific example we will use Tableau but any standard tool will be able to access Denodo through either the JDBC, ODBC or Web Services interface.

Creating a report in Tableau

Connecting to Denodo

Tableau provides a Denodo Connector using which views from Virtual DataPort are accessed and reports are generated. To connect to Denodo - just follow these steps

  1. Create a new data source using the Denodo Connector.
  2. Provide the server details and user credentials to connect to the database
  3. Click Sign in.
  4. Select the view: drag and drop in to the work space. We have two, monthly_sales_country and monthly_sales_area

Creating the report

Once the data connection is live, we can use the standard features from Tableau to create a dashboard that highlights the performance of the marketing promotions over time, split by country, split by area and total across all regions:

With this report at hand, that highlights the promotion-driven sales in orange versus the organic sales in blue, we can measure the performance of each marketing campaign per country and per month, so we can realize at a glance that the campaigns in Germany have low performance when compared to the rest of countries, or that the end of the year has been a great performer in Canada, like the outstanding performance of the first marketing promotion of the year in the UK. All of this thanks to the straightforward way of bringing data together that Denodo provides you.

In this tutorial we have seen how a business question that traditionally has been difficult to answer can be solved with Denodo in a fast and easy manner. Not only that, but implementing a solution using Data Virtualization allows us to have a live report - as we have not imported any data, but retrieved in real time, in the future when our sales database grows and new marketing data becomes available, the report will be always up to date at any point.

On top of that, as we have created semantically relevant business entities within our virtualization layer, we are in a position where deriving new reports or entities is trivial, so we can continue building views and adding value based on disparate data; the fact that the data sources are physically distributed and have a variety of formats does not impact the ease of building new analysis infrastructure on top of them, so we can have an intelligent framework that is centralized, so it prevents duplication of business logic and brings out the real value of the data within your organization.

This concludes the Agile BI tutorial - if you want to continue exploring what the Denodo Platform can do for you, take a look at the official documentation and play with your own use cases.