DATA SCIENCE & MACHINE LEARNING

Welcome to the Data Science and Machine Learning Tutorial!

In this tutorial you will build and deploy a machine learning algorithm aimed at predicting the energy consumption of buildings with the help of various components of the Denodo Platform: the Data Catalog, the Web Design Studio, the Virtual DataPort Server and the notebook environment Apache Zeppelin for Denodo. Most of the data and the scenario are taken from the ASHRAE - Great Energy Predictor III competition, that has been hosted in the well-known data science contests site Kaggle in late 2019.

Denodo data virtualization can play a very important role in the development, management and deployment of data science and machine learning initiatives.

Starting from data exploration and analysis of all the internal and external data assets (Apache Zeppelin for Denodo and Data Catalog), through a powerful data modeling interface (Web Design Studio) and an advanced optimized query execution engine (Virtual DataPort Server), and ending with a flexible data publication layer, Denodo can enter the equation in several phases of your machine learning projects.

Throughout this real-world use case, we hope that you will grasp the full potential of the Denodo platform in:

  • Graphically exploring the data, discovering associations and labeling in a business-friendly way in the Data Catalog.
  • Performing advanced data analysis with the Denodo query language or your preferred one (ex. Python or R) in Apache Zeppelin for Denodo.
  • Logically cleaning, combining and preparing data from heterogeneous data sources on the Web Design Studio.
  • Fetching data from Denodo views into your machine learning runtime for model validation, training and persistence.
  • Building a prediction publication system that gets the predictors values consistently and in real-time.

In total we have 6 tasks and a conclusion, let us see the synopsis of each tasks:

Task 1 - Data Modeling in Web Design Studio (part I)

  • First steps with Web Design Studio.
  • Build a data transformation flow in Denodo: data source abstraction, data modeling tools and techniques.
  • Some considerations on Denodo optimizer: branch pruning.

Task 2 - Data Tagging And Exploration With Data Catalog

  • Create a category structure that allows organizing the views in a business-friendly manner.
  • Create a tag to group together all the views related to the project.
  • Associate a category and a tag to one or several views.
  • Build a query that combines different views.

Task 3 - Exploratory Data Analysis in Zeppelin for Denodo

  • First steps with Apache Zeppelin for Denodo.
  • Developing and running VQL queries against the Virtual DataPort Server.
  • Display the results in a tabular format or in charts.
  • Understand data model and sketch an action plan for integration of data for training table preparation.

Task 4 - Data Modeling with Web Design Studio (part II)

  • In Web Design Studio, create the join of the tables needed to get the data prepared for algorithm ingestion.

Task 5 - Machine Learning Modeling: Validation, Training and Persistence

  • Import data from Denodo into Python.
  • Use mainstream Python analytical libraries to do features engineering, model validation and training as well as grid search.
  • Persist the model to disk for predictions publication.

Task 6 - Making a Prediction Data Service

  • Understand and test the Python-based machine learning prediction service.
  • Build a data transformation flow in Web Design Studio, that joins the view on the Python-based machine learning predictions service back with the main data preparation flow, built in Task 4.
  • Create a REST data service that serves the predictions to consuming applications in json format with some mandatory parameters.

Conclusion

Introduction

This document instructs you on how to create an execution environment for the Data Science and Machine Learning Tutorial, that you are about to start.

It uses Vagrant as the provisioning platform and VirtualBox as the hypervisor.

Once the environment is provisioned, you'll have a running virtual machine (VM) with the Denodo Platform installed as well as the needed data sources to be able to do the tutorial exercices.

The virtual machine can be managed independently from Vagrant, like any other. This means that you can stop and reboot it as you like with your modifications persisted in it.

Requirements

Hardware

The VM should be assigned 12G of memory and about 25GB of storage.

Software

You can use the latest versions of Vagrant and VirtualBox. The VM has been tested in the following environment:

  • Runtime provider: VirtualBox 6.1.18
  • Provisioning platform: Vagrant 2.2.4
  • Host system: WIndows 10 Pro

Installing via Vagrant provisioning

Process Overview

Once you start the provisioning, Vagrant downloads to your machine a canonical virtual machine image to base the provisioning on. It then turns on that virtual machine as a VirtualBox guest and run the script specified in vagrant/install-files/setup.sh that performs all the installation and configuration operations needed to get the environment ready to be used.

Steps

  1. Download and install Vagrant. Instructions here.
  2. Download and install VirtualBox. Instructions here.
  3. Download the dstutorial release archive denodo_tutorial_data_science.zip. Extract it. This archive contains two files.
  • dstutorial-release-20210428.zip
  • Denodo-systemd-services-release-20210408.tar
    They are described in the table below.
  1. Extract dstutorial-release-20210428.zip and navigate to the folder dstutorial-release-20210428.
  2. Copy the following files in vagrant/install-files/artifacts:

Name of file

Description

Apache Zeppelin for Denodo - Standalone.zip

Apache Zeppelin for Denodo installer, version must be 20210113

denodo.lic

Denodo Standalone license

denodo-install-8.0-ga-linux64.zip

Denodo Platform installer for Linux

denodo-v80-update-20210209.zip

Denodo Platform update 8.0-20210209

denodo-systemd-services-release-20210408.tar

Tar archive of setup for Denodo systemd services (already downloaded in step 3)

dstutorial-release-20210428.zip

Zip archive of repository dstutorial (already downloaded in step 3)

mysql-connector-java-8.0.20.zip

The Mysql Connector for Java to be downloaded from here

  1. Create a VirtualBox Host-only network adapter with IPv4 Address/Mask 192.168.140.1/24, if one is not already defined with this IPv4 Address/Mask.
  • In VirtualBox, go to File -> Host Network Manager, then define a new Host-only network adapter.

    Adapter
    *Name: VirtualBox Host-Only Ethernet Adapter
    *
    IPv4 Addrees: 192.168.140.1
    *
    IPv4 Network Mask: 255.255.255.0

    DHCP Server
    *Enable Server: no

If you want to use an existing Host-only network adapter, you will need to add the property name and change the property ip for config.vm.network in file vagrant/VagrantFile accordingly. Networking configuration documentation is available here.

  1. Navigate to folder vagrant (under dstutorial-release-20210428). You'll see that there is a file called Vagrantfile, that is the configuration file for the provisioning.
  2. Open a command prompt and run:
    vagrant up > provisioning.log

    This command stores the provisioning log into a file, provisioning.log to monitor that everything is going as expected. This command is valid if you are using the Git Bash terminal or the Windows Command Prompt (CMD). The provisioning process lasts between 20 and 35 minutes, depending on your hardware and Internet bandwidth. If you open the VirtualBox Manager during the provisioning, you'll find a machine called denodo.dstutorial.com being created and booted.
  3. When the vagrant up command ends it will return the cursor. Check that the tail of provisioning.log contains the following lines:

    default: check Meter Readings db (postgresql): Product Name : PostgreSQL
    default: Product Version : 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1)
    default: check Weather db (mysql): Product Name : MySQL
    default: Product Version : 8.0.23-0ubuntu0.20.04.1
    default: check Building location (xlsx over sftp): OK
    default: check Holidays data sources (web services): OK
    default: check file GHCN file (csv over sftp) site0_daily: OK
    default: check file GHCN file (csv over sftp) site2_daily: OK
    default: check file GHCN file (csv over sftp) site4_daily: OK
    default: check file GHCN file (csv over sftp) site13_daily: OK
    default: check file GHCN file (csv over sftp) site15_daily: OK
    default: check file GHCN file (csv over sftp) site0_monthly: OK
    default: check file GHCN file (csv over sftp) site2_monthly: OK
    default: check file GHCN file (csv over sftp) site4_monthly: OK
    default: check file GHCN file (csv over sftp) site13_monthly: OK
    default: check file GHCN file (csv over sftp) site15_monthly: OK
    default: check final prediction web service on dstutorial_sample_completed (v1): 200
    default: check final prediction web service on dstutorial_sample_completed (v2): 200
    default: This machine has IP: 192.168.140.100
    default: You may want to add it, to your local hosts file with name denodo.dstutorial.com
    default: provisioning started: Thu Apr 8 11:09:54 UTC 2021
    default: provisioning ended: Thu Apr 8 11:43:14 UTC 2021

  4. Add to the hosts file the entry for the virtual machine IP. In Windows the file is
    C:\Windows\System32\drivers\etc\hosts:
    ## Machine hosting the Data Science Tutorial Environment
    192.168.140.100 denodo.dstutorial.com


    The IP must match the one specified in config.vm.network in the VagrantFile.

At this point you can already connect to the Denodo applications deployed in the VM:

If you want to start again from a clean virtual machine, you should:

  1. Turn off your VirtualBox guest.
  2. Navigate to folder vagrant.
  3. Run vagrant destroy -f. This command erases your VirtualBox guest
  4. Run vagrant up > provisioning.log.

Appendix

Useful Technical Information

  • The guest OS is Ubuntu 20.04.2 LTS.
  • If you need to access the virtual machine via ssh/sftp, you can use the user denusr with password denusr. This is an administrator (sudo) user.
  • Denodo installation is under /opt/denodo/8.0.
  • All the services are deployed as systemd services, including Denodo ones. They are all setup to autostart at boot time. The user that runs the services is denusr.

Service Name

Systemd unit file

Virtual DataPort Server

denodo_vdp

Web Design Studio

denodo_design_studio

Data Catalog

denodo_data_catalog

Apache Zeppelin for Denodo

zeppelin

ML Web Service

flask_mlpred_rest

Holiday Web Service

flask_holiday_rest

Along with these services are those for the data sources (Postgres, MySQL and MongoDB) installed with the standard distribution software repositories.

Troubleshooting

The data to be used in the tutorial is distributed in several data sources.

In the table below you can see an overview of what the data sources are, their approximate size and their format.

The time-labeled target variable, meter_reading, is in the PostgreSQL training dataset.

All remaining data assets will be used to extract and engineer features to be leveraged as predictors for the learning algorithm:

  • building metadata (size, number of floors, use ...)
  • weather data at the hourly, daily and monthly level, each one with different indicators.
  • holidays

Description

Origin

Size

Source DB

Host

Training dataset

contest

1GB

~ 20M rows

PostgreSQL

postgres.dstutorial.com

Test dataset

contest

3GB

~ 42M rows

PostgreSQL

postgres.dstutorial.com

Building metadata

contest

45KB

Mongodb/json

mongo.dstutorial.com

Weather for test data

contest

19MB

~ 280k rows

Mysql

mysql.dstutorial.com

Weather for training data

contest

10MB

~140K rows

Mysql

mysql.dstutorial.com

Site-to-geolocation mapping [1]

contest forum

11KB

Xlsx over sftp

sftp.dstutorial.com

Daily and monthly weather data from stations closed to the site[2]

National Centers for Environmental Information

20MB

Csv over sftp

sftp.dstutorial.com

Provides time-related features (month, quarter, year, ...)

TPC-DS Kit

10MB

PostgreSQL

postgres.dstutorial.com

Whether a day is holiday in a given location

Generated from the PyPi Holidays project

~ 10KB

Web service

ws.dstutorial.com

[1] The sites locations have been obtained from here. The sites locations were not provided by the competition organizers but inferred by the competitors. We are going to include only those sites that were labeled as confirmed.

[2] Weather data has been obtained by taking the data from all the stations in the site's county at the monthly and the daily level. The data has been downloaded from here (Climate Data Online - CDO) and for site 2 from here. The two links seem to give the same GHCN data (Global Historical Climatology Network)

Across this tutorial, you will use the Web Design Studio for all the data preparation tasks.

The Web Design Studio is a web application that allows the developer to connect to data sources, define logical views to their data (base views in the Denodo nomenclature), transform them with relational operations (projections, unions, joins ...) and organize them in a meaningful way in databases and folder-based hierarchies. It is very similar in functionality to other popular SQL clients such as DBeaver, SQLDeveloper or MySQL Workbench. To access the Web Design Studio, open your browser and go to the following link:

http://denodo.dstutorial.com:9090/denodo-design-studio/?uri=//localhost:9999/#/

Login: dstutorial_usr

Password: dstutorial_usr

Once logged-in, In the left-hand side you will see the Server Explorer with two databases, dstutorial and dstutorial_sample_completed . You will work exclusively in the former and use the latter only if you want to go directly to the solution for a given task. If you click on the ( ) icon you will expand (collapse) the database. Do that for dstutorial and you'll see that the folder structure is already created for you.

If you drill-down to 01_connectivity/01_data_sources you will see the following folders:

  • building: this folder stores the data source definition for the building locations (ds_building_location) and other building metadata (ds_building)
  • date: this folder stores the data source definition pointing to the holidays web service (ds_holidays)
  • meter_reading: this folder stores the data source definition for the meter readings, the target variable that we want to predict (ds_meter_reading)
  • weather: this folder stores:
  • the data source definition for weather data provided in the contest (ds_weather)
  • an additional folder, ghcn, that stores the data sources to files downloaded from the National Oceanic and Atmospheric Administration ds_site{00,02,04,13,15}_{monthly,daily}.

The same folder structure is found also under 01_connectivity/02_base_views, but instead of containing the data sources, the sub-folders contain the base views, that is metadata-only views on the data we are interested in.

If you open for example the view bv_meter_reading_test in folder /01_connectivity/02_base_views/meter_reading (you can open it with a double click), you'll see the view definition (column names, types, descriptions), and Denodo-related metadata (creation and last modification time, folder, data source, database ...).

One of the most basic actions you can perform on Denodo views is to see their data: you can do that by clicking on Execution panel () then Execute ():

The Denodo engine generates the SQL-like query (VQL, Virtual Query Language in Denodo), and sends to the datasource the corresponding query that enables it to provide the result set that you want to obtain in the virtual layer. The data source then sends back the results to Denodo, which displays a sample for the user.

Once the data is abstracted in Denodo, that is a data source and a base view are defined, you can query it with a unified query language, the already introduced VQL, no matter the underlying data source supports SQL, partially or at all, or supports other types of query languages. Again with the bv_building_location view, we may be interested to know how many buildings per state are there. To do that, you can open a VQLShell (Tools->VQLShell) and write down your VQL query:

SELECT COUNT(*) AS n_buildings, state FROM bv_building_location GROUP BY state ORDER BY n_buildings DESC;

We should highlight that what we just showed is that we are analysing data in a remote Excel file using SQL-like syntax. That's powerful!

Let's now focus on integrating into the data preparation workflow the GHCN (Global Historical Climatology Network) data files. The ratio is that, even if we already have some weather information (to get an idea of what we already have you can open and run the views bv_weather_test and bv_weather_train), we may find that GHCN data can improve the accuracy of our prediction model by adding some more advanced meteorological indicator (ex. weather type, snowfall, snow depth) measured in the closest stations to the building and with two different aggregation periods (daily and monthly). We have downloaded the data from the NOAA website and stored them in CSV files in a SFTP server hsoted at sftp.dstutorial.com.

Out of the five sites for which the location is known (you can get this list by running: SELECT site_id,county FROM iv_building_location WHERE location_confirmed = 'Y' AND state = 'United States'), we just miss the monthly data for Cornell University (site 15).

To integrate this file in the data pipeline, we are going to perform the following steps:

  • Create a data source called ds_site15_monthly that points to a csv file.
  • Create a base view called bv_site15_monthly on top of ds_site15_monthly.
  • Create a derived view called iv_site15_monthly as a projection from bv_site15_monthly to add a field indicating the origin of the records, as we are going to append this data to all other sites data.
  • Create a derived view iv_site15_monthly_01_partitioned as a selection of iv_site15_monthly with filter site = '15'. This allows the Denodo optimizer to prune this branch when appropriate.
  • Add this view iv_site15_monthly_01_partitioned to the union view called iv_all_sites_weather_monthly that already contains the monthly data from the other sites.
  • Generate the statistics of iv_all_sites_weather_monthly. Discuss and assess data quality.
  • Create an aggregation view on iv_all_sites_weather_monthly called iv_all_sites_weather_monthly_grp_by_site. This view will be used in the subsequent steps of the data integration flow.

Let's get started with the creation of the data source. Go to /01_connectivity/01_data_sources/weather/ghcn>/, click on , then New->Datasource->Delimited File.

Fill in the data source parameters:

Name

ds_site15_monthly

Data route

FTP / SFTP / FTPS Client

Login

sftpuser

Password

sftpuser

URL

sftp://sftp.dstutorial.com//data/dstutorial/ghcn_weather/site15_monthly.csv

Column delimiter

,

Header

YES

Ignore matching errors

NO

After saving, test the connection. You should get a success message.

We are now able to create the base view bv_site15_monthly by clicking on Create Base View () in the datasource ds_site15_monthly just created. When doing so, Denodo creates a base view called ds_site15_monthly in the same folder as the datasource. Let's modify its name to bv_site15_monthly and move it to /01_connectivity/02_base_views/weather/ghcn. To do that, click on edit and change the property View name to bv_site15_monthly. Also, starting from field cdsd downwards, change the type of all fields without the suffix _attributes to decimal. This is done in the View Schema panel, and changing the Field Type. When you are done, move to the Metadata tab and change the folder as specified above.

One last action is needed before saving, that is to compute the statistics of the view. The statistics of a view consist in a few important indicators about its data: number of records, minimum and maximum values for numerical fields, number of distinct values and number of missing values. These numbers are essential for the cost-based optimizer to identify the most efficient execution plan. It is in general recommended to compute and maintain statistics for all the base views, so let's do that for bv_site15_monthly. In the view configuration panel, click on OPTIONS->Statistics, then Enable Statistics. All the fields are automatically selected, now click on Gather statistics for selected fields. The statistics are calculated and displayed.

Now, back to the SUMMARY tab, you can inspect the metadata and execute the view, as we did before.

Next step is to define a projection view, with the definition of a constant field site = '15'. Select the base view just created, then New->Selection. A new tab will be displayed for the projection view creation. Specify the following values for the required properties:

Name

iv_site15_monthly

Folder

/02_integration/weather/ghcn/monthly

To define a new field, go to the Output tab, click on New->New Field, then define the new field in the pop-up window.

Then save. The projection view is defined and executable.

In a very similar way, let's go on with the creation of view iv_site15_monthly_01_partitioned.

Select the view iv_site15_monthly then New->Selection. Now, no fields will be added; instead, we will define a filter on the newly created field, site. This filter is defined in the Where Conditions tab of the view.

In the Output tab you can set the name iv_site15_monthly_01_partitioned, then save.

It is time now to append the GHCN data from site 15, contained in the views just created, to the union that contains the data from all other sites.

This union view is called iv_all_sites_weather_monthly and is defined as the union of the following views:

  • iv_site00_monthly_01_partitioned
  • iv_site02_monthly_01_partitioned
  • iv_site04_monthly_01_partitioned
  • iv_site13_monthly_01_partitioned

Now, we are going to add iv_site15_monthly_01_partitioned to the union.

Let us open iv_all_sites_weather_monthly and move to the Model (extended union) panel, where we can drag and drop the newly created view iv_site15_monthly_01_partitioned. You'll notice that Denodo will automatically create the associations for fields with matching names. If a field does not exist in one of the input views, that will be nonetheless included in the union view, with missing values where appropriate.

After dragging and dropping the view in the model area, you can now save the modified union view.

We recommend now running a quick data check, that is a count on the union view that should return 6558 records and 71 columns.

To run this check, go to Execution Panel of the view iv_all_sites_weather_monthly, check the box Retrieve all rows, then execute the query:

As the final step of our GHCN files data transformation flow, we are going to aggregate the data in the view that we have just modified by site and date, with average as the aggregation measure. However, we are not blindly selecting all the numerical indicators as aggregation candidates. Instead, we will take into account only the indicators that show a percentage of missing values lower than a given threshold. The aggregation and numerical indicators selection operations are motivated mainly by two reasons:

  • Aggregation ensures preserving the cardinality of the training table, to be enriched with GHCN data. As we are going to join the GHCN data with the training table by site and date, we must take care of avoiding duplicates. Aggregating on these fields is the simplest solution to achieve that.
  • If you had a look at the data, you would have seen that there a lot of missing values in GHCN weather indicators. By aggregating and selecting the indicators with a low percent of missing values, we are hoping to increase the overall quality of the data while keeping the number of indicators as low as possible (low complexity). The intuitive justification would be that it is sufficient to have a data record from just a weather station, instead of requiring all or most of them to have a valid value to keep it.

To this end, let's calculate the statistics for iv_all_sites_weather_monthly. Open the view, then Options->Statistics. On the top-left corner you will enable the switch Enable Statistics and then launch the statistics gathering process by clicking on Gather statistics for selected fields

If you have a look at the last column, it seems that there are a lot of missing values for the numerical indicators, so that we shouldn't set a very low value for the cutoff missing percentage, let's say 70%. The indicators with at least 30% of not-null values will survive.

To get the list of these indicators, we can leverage a Denodo stored procedure called GET_VIEW_STATISTICS that enables access to views statistics via SQL syntax:

SELECT field_name FROM GET_VIEW_STATISTICS() WHERE input_database_name = 'dstutorial' AND input_name = 'iv_all_sites_weather_monthly' AND null_values/cast(rows_number as decimal) >0.7 AND field_name NOT LIKE '%_attributes' AND field_name NOT IN ('site', 'station','name' , 'latitude' , 'longitude', 'date', 'elevation') ORDER BY field_name

Copy and paste it in the VQLShell, then run it. It will return the following list of indicators:

dp01

dp10

dsnw

emsn

emxp

prcp

Snow

To create the aggregation view, select the view iv_all_sites_weather_monthly, then New->Selection. In the wizard, go to the Group By tab and select the fields site and date. Then switch to Output tab, rename the view to iv_all_sites_weather_monthly_grp_by_site and create a new aggregate field (New->New Aggregate Field)

Repeat this action with all the fields in the list above. Your aggregated view will have a total of 9 columns.

Now, let us review what we have done so far. We have virtualized data of a specific building site coming from a remote CSV reachable via SFTP, we have then merged this data with the data coming from all other sites and applied an aggregation operation to put the data in a format that can be joined later with the rest of the data available for the analysis. The reasons for applying the aggregation are outlined above and all these data preparation steps are motivated by the hope of finding powerful predictors for the modeling task in GHCN data.

The data transformation flow that we have built can be seen graphically in the Tree View of the final aggregation view. To display it, open the view and click on Tree View(): in the displayed tree the nodes represent either views or transformations. The nodes can be expanded and collapsed and the tree can be zoomed and zoomed out.

The Tree View is available for all the views and it is a great way to visualize the data transformation logic in a single diagram.

One additional feature that we would like to highlight is the advanced query rewriting capabilities of the Denodo optimizer. More specifically, let us see how the Denodo engine rewrites the query in the case it includes a filter on specific sites, say site in (0,15) (these correspond to University of Central Florida, Orlando and Cornell University, Ithaca, respectively; recall that site-to-location mapping is always available in view iv_building_location). To do that let's prefix that query with the DESC QUERYPLAN instruction, that simply allows generating the execution plan without actually executing the query, and then inspect the plan generated.

Open a VQLShell, write and run the following query:

DESC QUERYPLAN SELECT * FROM iv_all_sites_weather_monthly_grp_by_site WHERE site IN (0,15);

If you click on the top node, you'll see that the Denodo optimizer has applied Branch Pruning. This optimization ensures that the files where the application of the condition site in (0,15) will surely filter out all the records, are not read at all, leading to better execution plans and thus better performance.

As an example compare the above execution plan and the execution plan of an equivalent not-prunable query

DESC QUERYPLAN SELECT * FROM dstutorial_sample_completed.iv_all_sites_weather_monthly_grp_by_site_noopt WHERE site IN (0,15);

We have now ended our data transformation flow that represented the missing branch to complete the GHCN monthly data transformation pipeline. We have accomplished that using exclusively the Web Design Studio, the platform interface for data modeling and development, and leveraging built-in graphical wizard for the different operations (selections, filter, union and aggregation). The transformation logic, that has been applied in stacked layers of derived views, can be graphically displayed at each transformation node by opening the corresponding Tree View. We have also mentioned the branch pruning optimization that the Denodo engine applies when some filtering conditions are met.

Let us now move on to the Data Catalog, a component of the Denodo platform that exposes all the views and web services defined in the Denodo virtual databases in a web application. In this web application business users will be able to graphically search, real-time query, combine and organize with tags and categories the organization data assets.

In the previous session we have walked through the creation of a data transformation flow to read and integrate the GHCN monthly data from a specific site.

The focus of this session is to understand on how we can leverage the Data Catalog to:

  • Defining tags and categories structures that matches the functional meaning of data assets.
  • Tagging and categorizing the data assets defined in the virtual layer, with the freshly defined tags and categories.
  • Graphically build and run multi-view queries thanks to relational associations
  • Deploy the query in the Virtual DataPort Server.

You start by signing in to the Data Catalog application. In your browser open the following url:

http://denodo.dstutorial.com:9090/denodo-data-catalog/Login

Then enter the username and password as specified below, then click on Sign In

Username: dstutorial_usr

Password: dstutorial_usr

Virtual DataPort Server: localhost - dstutorial

As first step, define the categories ghcn - monthly and ghcn - daily with parent category weather. In the top menu bar, click on Administration->Catalog Management->Categories, the Category Management window is opened. There are already two categories defined, weather is the parent category of the ones to be defined.

Click on the New Category icon () then fill in the information as shown:

Proceed similarly with the category ghcn - monthly.

At the end of this step you should have the following categories structure:

Now go on with the creation of a tag, called predict-building-consumption. Click on Administration->Catalog Management->Tags to open the Tag Management menu. Once there, click on the New Tag icon (), then fill in the information as shown below (as you can see the description can be formatted in HTML).

As no tags were already defined, the Tag Management menu should look as the following after the tag creation:

Now you can associate the views to the categories and tag. The process is the same for tags and categories, so we are showing how to complete it for a category and you can complete the remaining ones on your own.

Go back to the Category Management menu (Administration->Content Management->Categories) then click on the edit icon () of the category that you want to modify, then click on the button Add

from

Database/Folders tree. A new window will be displayed in which you can search and select the items that you want to associate to the category: for example for category ghcn - daily, looking for daily allows you to browse for the relevant views and other objects.

You can associate the remaining category and tag with the following search logic:

  • ghcn - monthly: search monthly in search dialog and select all the matching views
  • predict-building-consumption: select everything in database dstutorial

You can now use your categories and tags in searching your data assets. In the example below, we look for views with tag predict-building-consumption, category ghcn - daily and the token site00 in their name, fields names or descriptions.

As already mentioned, the Data Catalog not only allows you searching the data assets defined in the Denodo virtual layer, but enables graphically querying those assets with no SQL skills required.

As an example let's say that you want to extract the meter reading (meter_reading) values, just for the first quarter 2016 and for electricity meter (meter=0), with some building information such as the build year (built_year) and the size (square_feet). Notice that although the information is not in a single view, you will be able to do your extract thanks to the existence of view associations defined in the virtual layer that permit multi-views queries.

So, to build your query head to the Search panel, clean all the filtering criteria that may be already defined and then search for iv_meter_reading. By clicking on the iv_meter_reading view, you will be redirected to the view summary page, in which you will find the view description, all the tags and categories it may be associated with and the schema of the view with field names, descriptions and types. If you jump to the Relationships tab, you will see that this view has two relationships with views iv_building and iv_weather. You can display associations details by hovering on the arrow.

Relationships, also called Associations, play an important role in different contexts of data modeling in Denodo, among the benefits they bring is the fact of giving the ability to perform multi-view queries to the Data Catalog users.

You can now move on and build your query. Go to the Query tab, then:

  • Drag and drop timestamp_in, meter_reading, year_built and square_feet in the Output Columns zone. Note that year_built and square_feet belong to iv_building, that you find in Relationship Fields area.
  • In the Filters zone, click on the Add icon () then enter the following expression:

getyear(timestamp_in) = 2016 and getquarter(timestamp_in) = 1 and meter = 0

Your query tab should now look like the following:

It is quite handy to have the ability to inspect the VQL code (just click on the icon), for example for sharing with a Denodo developer or just to understand query details, such as the join key in multi-view queries. You can now run the query to display the results in the same browser tab or export it for further analysis in a third party tool. You will need to choose the export format (CSV, Excel and Tableau) as well as the number rows to be extracted.

Once you have exported the data, click on Save() and give the query meaningful name and description.

There are several actions that you may want to do now on the saved query:

  • Modify and enhancing it at a later moment.
  • Share it with peers, by generating a share url.
  • Deploy it back to the underlying Virtual DataPort Server. This is useful when you wish to trigger the industrialisation phase of the query or just make it available as-is to peers as a standard Denodo view.

You will now deploy the view to the Virtual DataPort Server. To do that, click on the My Query button on the top menu bar. In the Query Explorer, locate the target query and click on .

In the deploy menu fill in the required details (Virtual DataPort Server, credentials, view name and description and folder), then click Deploy.

Server URI

//localhost:9999/dstutorial

User

dstutorial_usr

Password

dstutorial_usr

Target Folder

(dstutorial)/08_deployed_queries

View name

readings_electricity_2016Q1

View description

Electricity meter readings for all sites in first quarter 2016 (deployed from Data Catalog)

Replace if exists

LEAVE BLANK

The query is now deployed, you may want to open, execute and reuse the view as per your needs in the Web Design Studio.

With the query deployment, you have completed the proposed tasks in the Data Catalog. We hope that the exercises have made you understand its potential as a data exploration and discovery tool, featuring an advanced graphical query building tool, keyword-based search capabilities and categories and tags as flexible concepts for data assets governance.

The next session will be focused on Apache Zeppelin for Denodo which adds to the Apache Zeppelin community version a built-in Denodo interpreter and integrated authentication with the platform.

In Apache Zeppelin for Denodo, data scientists will leverage a notebook-based environment in which they can alternate code paragraphs, to perform a computation or to generate a plot, with free markdown-formatted text paragraphs, to write down notes or explanations of what they are doing. Thus, it is the ideal environment to perform exploratory data analysis, a core phase to prepare for machine learning algorithm training and validation.

The focus of this part is on exploratory data analysis, an essential preparatory activity for the training of the learning algorithm.

The Denodo Platform includes a notebook-like application, Apache Zeppelin for Denodo, that enables the Data Scientists to build and run their data queries, perform advanced data analysis and easily and nicely plot the results.

In Apache Zeppelin for Denodo, you can alternate markdown-formatted text with executable code, by using one of the available interpreters (among them, the Denodo interpreter itself, Python, R and Spark): this implies that you can define your query in a code cell, run it and graphically build a plot with its results and then write down your notes that would explain what you did and/or your conclusions to be shared with peers or to be accessed at a later time.

In this part you will:

  • Sign into Apache Zeppelin for Denodo.
  • Go through the Exploratory Data Analysis notebook that we have already created.
  • Create and run your own VQL queries, that leverage the predefined Denodo interpreter.
  • Build and customize visualizations using built-in plotting capabilities.

At the end of this part you will have a clear understanding of the data preparation steps that you need to complete to get the data ready for machine learning algorithm ingestion.

Let's get started by opening the Apache Zeppelin for Denodo url with your browser.

Username: dstutorial_usr

Password: dstutorial_usr

In the Welcome page, you will see that some notebooks have been defined.

The notebook that we are going to use for this part is 01-Exploratory-Data-Analysis (you can use 01-Exploratory-Data-Analysis-Completed if you have doubts and have a look at a working solution).

Click on 01-Exploratory-Data-Analysis to open it.

Now take the time to explore the main functionalities of the application:

  • The notebook is organized in paragraphs (or cells). An interpreter is associated with each paragraph and the paragraph is executable with the interpreter.
  • The definition of the interpreter for a paragraph is done by putting in the paragraph first line the name of the interpreter prefixed with the percent (%) sign, for example %denodo indicates that you can write and execute VQL code in that cell.
  • Several actions can be triggered in the top menu: among them, to run all the cells of the notebook, to export it and to clone it.
  • Cells can be executed by clicking in the top-far right menu. Other options are available in the menu, you can display them by clicking on .

Now, that you are more familiar with the Zeppelin UI, you can start reading the notebook, executing the queries against the Denodo server and analyse the results.

The following short screen recording shows how to run a query, visualize the results in table format and transform the table into a simple line chart.

In this notebook we have explored the data available for the Machine Learning algorithm.

Thanks to Zeppelin interactive data analysis capabilities, we now understand the data much better both from a technical point of view (data types, volumes, join keys) as well as purely from a data science perspective (distribution of input variables, correlation with target variables ...). The analysis can be extended as desired with new queries and/or visualizations.

Let us stress out that we performed our analysis with a single source, the Denodo server, that is taking care of abstracting the underlying data sources and with a single interpreter, the Denodo SQL-like interpreter (VQL) and engine that chooses the most performant plan for queries execution.

At this time, let's summarize what the data preparation steps to get the data ready to be fed to the ML algorithm:

  • Join the meter readings data with the time-related data
  • join conditions: datepart(iv_meter_reading.timestamp) = iv_dim_date.d_date
  • variables to be kept:

column name

comment

transformations

d_day_name

Change name, remove prefix d_

to be one-hot-encoded

d_weekend

Change name, remove prefix d_

to be one-hot-encoded

  • Join the meter readings data with the building data
  • join conditions: iv_building_building_location_01.building_id = iv_meter_reading_01_date.building_id
  • join type: LEFT
  • variables to be kept:

column name

comment

transformations

site_id

to be used in join

N/A

building_id

informative variable

N/A

timestamp_in

to be used in join

N/A

meter

informative variable

N/A

row_id

informative variable

N/A

istrain

separate train and test data

N/A

square_feet

predictive variable

to be transformed to number

primary_use

predictive variable

to be one-hot-encoded

year_built

useful to create age, a potential predictive variable

to be transformed to number

floor_count

predictive variable

N/A

location_confirmed

used in filtering validated sites

N/A

state

used in filtering validated sites

N/A

meter_reading

target variable

N/A

  • Join the meter readings data with the hourly weather data
  • join conditions: iv_weather.site_id = iv_meter_reading_03.site_id AND previous hour(iv_weather.timestamp) = iv_meter_reading_03.timestamp
  • join type: LEFT
  • variables to be kept:

column name

comment

transformations

air_temperature

N/A

N/A

cloud_coverage

N/A

N/A

dew_temperature

N/A

N/A

precip_depth_1_hr

N/A

N/A

sea_level_pressure

N/A

N/A

wind_direction

N/A

cosine functin

wind_speed

N/A

N/A

  • output view: iv_meter_reading_05_date_building_weather
  • Join the meter readings data with the GHCN weather data (this join is going to be merged with the previous oe)
  • Join conditions with daily GHCN files: iv_meter_reading_03_date_building.site_id = iv_all_sites_weather_daily_grp_by_site.site_id AND datepart(iv_meter_reading_03_date_building.timestamp_in) = previous day(iv_all_sites_weather_daily_grp_by_site.date)
  • Join conditions with monthly GHCN files: iv_meter_reading_03_date_building.site_id = iv_all_sites_weather_monthly_grp_by_site.site_id AND monthpart(iv_meter_reading_03_date_building.timestamp_in) = previous month(iv_all_sites_weather_daily_grp_by_site.date)
  • Join type: LEFT
  • Variables to be kept:

column name

comment

transformations

daily_grp_by_site.prcp

N/A

N/A

daily_grp_by_site.snow

N/A

N/A

daily_grp_by_site.tmin

N/A

N/A

daily_grp_by_site.tmax

N/A

N/A

daily_grp_by_site.snwd

N/A

N/A

daily_grp_by_site.tobs

N/A

N/A

daily_grp_by_site.awnd

N/A

N/A

daily_grp_by_site.wsf2

N/A

N/A

daily_grp_by_site.wsf5

N/A

N/A

daily_grp_by_site.tavg

N/A

N/A

monthly_grp_by_site.dp01

N/A

N/A

monthly_grp_by_site.dp10

N/A

N/A

monthly_grp_by_site.dsnw

N/A

N/A

monthly_grp_by_site.emsn

N/A

N/A

monthly_grp_by_site.emxp

N/A

N/A

monthly_grp_by_site.prcp

N/A

N/A

monthly_grp_by_site.snow

N/A

N/A

  • Join the training and test data with the holidays data
  • Join keys: last two characters of(iv_meter_reading_05_date_building_weather.location) = iv_holidays_state_cd.state
  • Join type: LEFT
  • Variables to be kept:

column name

transformations

is_holiday

CASE WHEN iv_holidays.date = FORMATDATE( 'yyyy-MM-dd&', iv_meter_reading_05_date_building_weather.timestamp_in) THEN true ELSE false

  • output view: iv_meter_reading_07_date_building_weather_holidays

To summarize, our action plan is to construct three joins:

  • The fact table iv_meter_readings_03 with hourly weather data, iv_weather.
  • The resulting view with GHCN daily and monthly data.
  • The resulting view with the holidays data.

Along the way, we are going to apply any field transformation and any other operation that may be needed.

These tasks will be accomplished in the Web Design Studio, that we already know from a previous chapter, so let's head back to that application to get the work done.

In the previous parts we went through the abstraction and first integration steps of a new data source (Task 1), used the Data Catalog to define a business-friendly way to organize the available data assets through tags and categories (Task 2) and performed an interactive data analysis in Apache Zeppelin for Denodo (Task 3).

With a clear idea of the data preparation steps needed, let us now move back to the Web Design Studio, the main data modeling interface in the Platform, to build the data preparation flow for the machine learning algorithm training and validation.

The data preparation flow we are going to develop consists of the following steps:

  • Join the fact table iv_meter_reading_03_date_building with hourly weather data, iv_weather as well as GHCN daily and monthly data. We will do this 3-way join in the same transformation.
  • Join the resulting view with the holidays data.
  • Define a selection view with the final adjustments, formatting and fields transformations.

Open the Web Design Studio url and login.

User: dstutorial_usr

Password: dstutorial_usr

Join with weather data

  1. Select folder 10_prepared_data_for_ml/staging, click on then New->Join.
  2. Drag and drop the views iv_meter_reading_03_date_building and iv_weather in the work area.
  3. Click on for view iv_meter_reading_03_date_building and drag the generated arrow on top of iv_weather.
  4. In the generated column lists at the bottom, link site_id and timestamp_in, from the left entity to the right entity.
  5. Drag and drop the view iv_all_sites_weather_monthly in the work area.
  6. Click on for view iv_meter_reading_03_date_building and drag the generated arrow on top of iv_all_sites_weather_monthly
  7. In the generated column lists at the bottom link site_id on the left with site on the right. Do the same with timestamp_in on the left with date on the right. The join condition is not valid yet, we will adjust it in a later step.
  8. Repeat steps 5 to 7 with iv_all_sites_weather_daily.
  9. Go to the Join Conditions panel, you'll find three join conditions corresponding to the three views that we are joining. In the first one, change the join type to LEFT OUTER then click on edit and substitute iv_weather.timestamp_in with ADDHOUR(iv_weather.timestamp_in,1).

  1. In the second join condition, change the join type to LEFT OUTER, then click on edit and replace the condition involving timestamp_in with:
    (formatdate('yyyy-MM', iv_meter_reading_03_date_building.timestamp_in) = formatdate('yyyy-MM', addmonth(to_localdate('yyyy-MM', iv_all_sites_weather_monthly_grp_by_site.date), 1))

  1. In the third join condition, change the type to LEFT OUTER, then click on edit and replace the condition involving timestamp_in with the following one:
    (formatdate('yyyy-MM-dd', iv_meter_reading_03_date_building.timestamp_in) = formatdate('yyyy-MM-dd', addday(to_localdate('yyyy-MM-dd', iv_all_sites_weather_daily_grp_by_site.date), 1))

  1. Go to the Output panel, and do the following actions:
  • rename the view to iv_meter_reading_05_date_building_weather
  • keep and rename (if needed) the following fields as per instructions detailed in this table:

Source Table

Target field

iv_meter_reading_03_date_building

Keep all the fields except stations_file and additional_weather_info

iv_weather

Keep all the fields except site_id,timestamp_in and istrain

iv_all_sites_weather_monthly_grp_by_site

Keep all the fields except site and date, prefix them with monthly_

iv_all_sites_weather_daily_grp_by_site

Keep all the fields except site and date, prefix them with daily_

  1. Set building_id, meter and timestamp as primary keys (the corresponding PK should look like once defined as primary keys).
  1. Save the view

To check that the view is correct open a VQL Shell (Tools->VQL Shell in the top menu) and run the following query (make sure that you are retrieving all rows):

SELECT * FROM

dstutorial.iv_meter_reading_05_date_building_weather

WHERE site_id = 0

AND EXTRACT(MONTH FROM timestamp_in) = 2;

It must return 245273 records and 41 columns in about 45 seconds. Moreover, visually check that fields coming from iv_weather are mostly non-missing. We will do a more in-depth missing values analysis later on.

Then run the following query:

SELECT COUNT(*) FROM

dstutorial.iv_meter_reading_05_date_building_weather

WHERE site_id = 0;

You should get 3336742 as the result in about 4 minutes.

Join with holiday data

To create the second join transformation, do the following:

  1. Select folder 10_prepared_data_for_ml/staging, click on then New->Join.
  2. Drag and drop the views iv_meter_reading_05_date_building_weather and iv_holidays_state_cd in the work area.
  3. Click on for view iv_meter_reading_05_date_building_weather and drag the generated arrow on top of iv_weather.
  4. In the generated column lists at the bottom, link location to state_cd and timestamp_in to date .
  5. Go to the Join Conditions panel, change the join type to LEFT OUTER, then click on edit and substitute the condition with the following one:
    TRIM(SUBSTR(TRIM(location),(LEN(TRIM(location))-2),4)) = state_cd AND FORMATDATE('yyyy-MM-dd',timestamp_in) = date

  1. Go to the Output panel, rename the view to iv_meter_reading_07_date_building_weather_holidays and remove the field iv_holidays_state_cd.state_cd.
  2. Rename the field date to is_holiday.
  3. Set building_id, meter and timestamp_in as primary keys (the corresponding PK should look like once defined as primary keys).
  4. Save.

To check that the view is correct open a VQL Shell (Tools->VQL Shell in the top menu) and run the following queries (make sure that you are retrieving all rows):

SELECT count(*)

FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays

WHERE site_id = 0;

It must return 3336742 records in about 4 minutes.

Then:

SELECT DISTINCT FORMATDATE('yyyy-MM-dd', timestamp_in), is_holiday, holiday_description

FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays

WHERE site_id = 0

AND FORMATDATE('yyyy-MM', timestamp_in) = '2016-01';

You should get 31 rows and 3 columns in about 60 seconds. Visually check that is_holiday is empty for all days except 2016-01-01 (New Year's Day) and 2016-01-18 (Martin Luther King Jr. Day). This field will be transformed to a boolean flag later.

Preparation of the final view

Our data is almost ready to be fed into the machine learning algorithm. However there are still some fields to be transformed.

  • floor_count: it is a text column, but it represents an integer.
  • year_built: we can compute the building age on this column as it is a more intuitive and meaningful predictor.
  • wind_direction: we can apply the cosine function, to better represent the information that it carries
  • timestamp_in: the target variable has a strong dependence on the hour of the day. Let us extract it, to let the learning algorithm leverage it.

Let us apply all this transformation logic in a new selection view. As we have been defining the views graphically so far, we are going to create it in the VQL Shell, the interactive editor and command line environment embedded in the Web Design Studio. To this end, we will build and then execute the appropriate CREATE VIEW statement.

To start with, let us select iv_meter_reading_07_date_building_weather_holidays, then click on and New->VQL Shell >Select .

Let us apply the following modifications to the displayed SELECT clause:

  • Overwrite wind_direction with the following expression:
    COS(RADIANS(CAST('int',wind_direction))) AS wind_direction
  • Transform is_holiday to a boolean with the following expression:
    CASE WHEN (is_holiday IS null) THEN false ELSE true END is_holiday
  • Create the column is_weekend with the boolean form of column weekend, that can be dropped:
    CASE WHEN(iv_meter_reading_07_date_building_weather_holidays.weekend = 'Y') THEN true ELSE false END is_weekend
  • Calculate building_age with the following expression (year_built can be dropped afterwards):
    (GETYEAR(CURRENT_DATE())-year_built) AS building_age
  • Calculate reading_hour with the following expression:
    COS(RADIANS(GETHOUR(timestamp_in)) AS reading_hour
  • Cast all the columns of type decimal to float
  • Prefix the SELECT statement with:
    CREATE VIEW meter_reading_final FOLDER='/10_prepared_data_for_ml' PRIMARY KEY('building_id','meter', 'timestamp') AS

The final CREATE VIEW statement will look like the following:

CREATE OR REPLACE VIEW meter_reading_final FOLDER='/10_prepared_data_for_ml'

PRIMARY KEY ( 'building_id' , 'meter' , 'timestamp_in' )

AS SELECT

timestamp_in,

building_id,

meter,

site_id,

istrain,

CAST('float',meter_reading) AS meter_reading,

day_name,

CASE WHEN (weekend = 'Y') THEN true ELSE false END is_weekend,

primary_use,

square_feet,

(GETYEAR(CURRENT_DATE())-year_built) AS building_age,

CAST('int',floor_count) AS floor_count,

location,

county,

state,

climate_zone,

location_confirmed,

CAST('float',air_temperature) AS air_temperature,

CAST('float',cloud_coverage) AS cloud_coverage,

CAST('float',dew_temperature) AS dew_temperature,

CAST('float',precip_depth_1_hr) AS precip_depth_1_hr,

CAST('float',sea_level_pressure) AS sea_level_pressure,

COS(RADIANS(CAST('int',wind_direction))) AS wind_direction,

CAST('float',wind_speed) AS wind_speed,

CAST('float',monthly_dp01) AS monthly_dp01,

CAST('float',monthly_dp10) AS monthly_dp10,

CAST('float',monthly_dsnw) AS monthly_dsnw,

CAST('float',monthly_emsn) AS monthly_emsn,

CAST('float',monthly_emxp) AS monthly_emxp,

CAST('float',monthly_prcp) AS monthly_prcp,

CAST('float',monthly_snow) AS monthly_snow,

CAST('float',daily_prcp) AS daily_prcp,

CAST('float',daily_snow) AS daily_snow,

CAST('float',daily_tmin) AS daily_tmin,

CAST('float',daily_tmax) AS daily_tmax,

CAST('float',daily_snwd) AS daily_snwd,

CAST('float',daily_tobs) AS daily_tobs,

CAST('float',daily_awnd) AS daily_awnd,

CAST('float',daily_wsf2) AS daily_wsf2,

CAST('float',daily_wsf5) AS daily_wsf5,

CAST('float',daily_tavg) AS daily_tavg,

CASE WHEN (is_holiday IS null ) THEN false ELSE true END is_holiday,

holiday_description,

COS(RADIANS(GETHOUR(timestamp_in))) AS reading_hour

FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays;

Before running it, make sure you are connected to the database dstutorial. Run it and then refresh the Server Explorer (to do that, select the database dstutorial, then ->Refresh), you'll find the view meter_reading_final in folder 10_prepared_data_for_ml.

A final step is needed: as meter_reading_final is the view to be exposed to the machine learning environment, we would like to protect the backend data sources from excessive load and at the same time we would like to always ensure an acceptable data extraction time. For this reason we will make it compulsory to specify at least one of the sites by defining a view parameter.

To do that:

  1. Open the view meter_reading_final.
  2. Go to EDIT->Model Selection->Edit view parameters.
  3. Define a new view parameter called input_site_id of type int, click Add, then click Ok.
  4. Go to the Where Conditions menu and add the condition site_id = input_site_id.
  5. Save.

Test the view. To this end, run in a VQL Shell the following query (make sure you are retrieving all rows):

SELECT count(*)
FROM dstutorial.meter_reading_final
WHERE input_site_id = 0;

You should get 3336742 as the result in about 4 minutes.

If the obtained result is ok, you can run a second query in the VQLShell (check the Retrieve all rows box)

SELECT *
FROM dstutorial.meter_reading_final
WHERE input_site_id = 0
AND building_id = 10
AND EXTRACT(MONTH FROM timestamp_in) = 6
AND EXTRACT(YEAR FROM timestamp_in) = 2017 ;

You should obtain 720 rows and 45 columns in a little bit less than 2 minutes. Take the time to check if the transformations applied are working as expected. You can scroll right and down in the data preview panel to see all the fields.

Notice that in these queries we specified a value for parameter input_site_id. If a value for this parameter is not specified, the query will fail with a No search methods ready to be run error. This is the expected behaviour.

It is now time to wrap this part up: we have prepared the data for the machine learning algorithm training by enriching the meter readings table with hourly weather data, GHCN weather data and holidays-related data. All these transformations have been logically defined in the Web Design Studio, meaning that we have generated just the VQL code that implements our modeling logic, with no data replication. When executing the query, Denodo will find the best execution plan and get the results on the fly from the data sources.

It is interesting to open the Tree View of the final view, meter_reading_final, to visually grasp the complexity of the data transformations that the input data must flow through. We recommend spending some time in reviewing the Tree View of the final view, collapsing/expanding nodes as needed. Please note that you have zoom in/zoom out and exporting functionality as shown in the following screen recording:

Let us move on to learn how to build and run our Machine Learning validation, training and model persistence code in Apache Zeppelin for Denodo, using mainstream Python data science libraries, such as pandas and scikit-learn.

In the previous parts of the tutorial, we have learnt how different Denodo Platform components can help in typical phases of Machine Learning projects: advanced data exploration and tagging, in Data Catalog (Task 2) and Apache Zeppelin for Denodo (Task 3) as well as data integration in Web Design Studio (Task 1 and 4).

In particular, in Task 4 we have prepared our training final training view, called meter_reading_final, through a pipeline that adds potential predictors from the building metadata table (size, number of floors ...), hourly, daily and monthly weather data at building location and information about holidays.

We are now moving back to Apache Zeppelin for Denodo for the core Machine Learning tasks:

  • Data ingestion from the Denodo view.
  • Last steps of features engineering.
  • Model choice, cross validation and training.
  • Model persistence on disk.

Open the Apache Zeppelin url in your browser, then login with dstutorial_usr as user and password.

In the notebook list, click on 02-ML-Algorithm-Training to open it.

Now you can start reading the notebook and executing the code cells. If you need to go back to a known working version of the notebook you can do so by switching to the notebook 02-ML-Algorithm-Training-Completed.

We are using exclusively the Python interpreter in this notebook. If you need to reinitialize the runtime environment you can click on in the top-right menu, then on corresponding to the Python interpreter.

Please refer to Task 3 in this guide for a quick tour of the UI main functionalities.

Some screenshots are reported here for reference:

If the notebook has been successfully run up to the final cell, the prediction model is now saved on disk, along with other complementary objects, in folder /tmp/.

You are invited to change some model parameters, or modify/remove/add some features and check if it has an impact on the evaluation metrics. You can change the model itself, as there are many alternatives that may work better in this scenario. You can also make the cross validation and grid search more or less granular by modifying the parameter list and/or their values. Apache Zeppelin for Denodo represents the ideal environment for this kind of tasks, that make part of day-to-day activity of a machine learning engineer, as you can perform these tests interactively and alternating code with comments and charts that let a colleague easily understand your workflow.

The final stage of the tutorial will let you learn how Denodo can distribute your models predictions through a REST data service that allows final applications to ignore the details of the algorithm used (i.e its predictors), and retrieve the predictions they need via variables that they are aware of, that is building_id and timestamp_in.

One of the goals of defining a virtual layer is to offer the data consumers a simple, uniform and efficient access to the data sources. The same general goal applies in Data Science workflows. Once our ML algorithm is fitted, we should design and implement the access system to its predictions.

Here, we are going to focus on the on-demand access pattern. Other considerations and techniques apply if a batch access is under study, you can get some insight here.

The ML algorithm is published via a REST web service that takes the values of all the input variables that the algorithm has been trained on and restitutes the prediction. As the final applications and users do not know all these input variables (their number may be huge and moreover it can vary over time), we can leverage the fact that all the data transformations steps have already been created in Denodo. We will then define a data source on the ML web service and construct another web service in Denodo, that will get the search parameters sent from the application, get the input variables, pass them to the ML web service and deliver the prediction back to the application.

We will go through the following steps:

  • Understand how the ML web service must be called.
  • Identify the validity conditions under which the ML web service gives consistent predictions.

Then, in Web Design Studio:

  • Create the appropriate folder structure.
  • Create a JSON data source abstracting the ML web service.
  • Create a base view on this data source.
  • Create a selection view on meter_reading_final to prepare all (and only) the predictor fields to be joined with the base view created in the previous step.
  • Create a join view between the views just created, with the predictor fields as join keys.
  • Create and deploy a REST data service that will take the input parameters timestamp_in and building_id and will give back the relevant predicted meter readings values with their respective timestamp.

Before starting with the development, let's see how to call the ML web service and what are its validity boundaries.

The ML web service expects all the predictors that the model has been trained on to emit the building energy consumption predictions. For your reference, we recall that the ML has been cross-validated and trained in the previous task where you'll find the predictors list.

There are two endpoints available for the ML web service:

  • v1 (endpoint: http://mlservice.dstutorial.com:5010/meter_readings/api/v1.0/):
    Each predictor must be passed as a couple of name and value, as shown here:
    param1=value1¶m2=value2&[...]¶mN=valueN
  • v2 (endpoint: http://mlservice.dstutorial.com:5010/meter_readings/api/v2.0/):
    All the predictors must be embedded in a JSON file and this file passed as value for parameter input_data, as shown here:
    [{"param1":"value1","param2":"value2", ..., "paramN":"valueN"}]

Both endpoints replies with a JSON file containing the field predicted_meter_reading and its value, ex.{"predicted_meter_reading":180.186}.

For the rest of the task, we will be using endpoint v1. If you are interested in how to do the integration with endpoint v2, you'll find the equivalent data preparation flow in database dstutorial_sample_completed.

Regarding the validity boundary of the web service, let's recall the data scope that we used to train our machine learning algorithm:

  • The site is University of Central Florida, Orlando, FL (site_id=0).
  • The meter is chilledwater (meter=1).

It is very important to let end users know the validity scope of the prediction service and to design it in such a way that predictions are always consistent with the data scope used at training time.

Moreover, we should keep in mind that we can retrieve the values for the predictors that are already available, in our specific case these correspond to the records of view bv_meter_reading_test. This imposes a condition on the timestamp to be specified by the data consumer, specifically 2018-12-31 23:00:00 (SELECT MAX(timestamp_in) FROM bv_meter_reading_test) .

After having mentioned these important considerations, we can now get started with the REST web service development.

The first step would be to test the low-level Python ML web service. You can open a browser (or another REST client) and open the following URL.

http://mlservice.dstutorial.com:5010/meter_readings/api/v1.0/?primary_use=Office&square_feet=27000&building_age=11&dew_temperature=9.4&monthly_dp01=12.6&monthly_dp10=8.1&monthly_emxp=1.783&monthly_prcp=6.390909&reading_hour=0.9876883405951378

You should get back a floating number.

Let's go on now with defining a data source and a base view on this web service.

Let's log in into the Web Design Studio

User: dstutorial_usr
Password: dstutorial_usr

As a first step, create the following folders:

  • /01_connectivity/01_data_sources/prediction
  • /01_connectivity/02_base_views/prediction
  • /02_integration/prediction/staging
  • /02_integration/prediction/staging/v1

To create the first folder, select /01_connectivity/01_data_sources, click on then New->Folder and enter prediction as the name. The folder is created and visible in the folder tree. Do the same for the remaining folders.

Now select /01_connectivity/01_data_sources/prediction, click on , then >New->Data Source->JSON.

In the new wizard, set the Name to ds_ml_prediction_service_v1 and Data Route to HTTP Client.

Then expand the Configuration menu and paste the following url in the URL field:

http://denodo.dstutorial.com:5010/meter_readings/api/v1.0/?primary_use=@primary_use&square_feet=@square_feet&building_age=@building_age&dew_temperature=@dew_temperature&monthly_dp01=@monthly_dp01&monthly_dp10=@monthly_dp10&monthly_emxp=@monthly_emxp&monthly_prcp=@monthly_prcp&reading_hour=@reading_hour

As you can see, we are passing the predictors names and their values as separated parameters (param_name=@param_value).

By prefixing the parameter value with a @ we are specifying that the value we are passing is not constant, but instead must be dynamically generate either from a user or from another view. In order to let Denodo probe the web service we need to enter the probe values. Based on the web service response Denodo will create the structure of the base view.

If you now click on Test Connection, you will be prompted to specify the probe values for all the predictors. You can enter the values listed in this table:

primary_use

Education

square_feet

1000

building_age

10

dew_temperature

10.0

monthly_dp01

10.0

monthly_dp10

10.0

monthly_emxp

10.0

monthly_prcp

10.0

reading_hour

0.0

Then Save.

Go on with creating a base view on ds_ml_prediction_service_v1.

Open the data source, if it is not already open, then click on CREATE BASE VIEW.

Enter the same probe values as above for the parameters. The view is created. Click on EDIT to change:

  • The name of the view to bv_ml_prediction_service_v1.
  • The folder to /01_connectivity/02_base_views/prediction.

Finally try executing the view with the same parameters values specified above.

The data service preparation goes on now with the creation of a selection view on meter_reading_final. The goal is to prepare all the predictor fields.

So let's select the folder /02_integration/prediction/staging/v1, then click on , then New->Selection. Drag and drop the view iv_prediction_service_ondemand_v1 in the work area, then:

  1. In Where Conditions, enter input_site_id=0 and meter=1.
  2. In the Output menu, select all the fields but timestamp_in, building_id, meter, site_id and click on Remove.
  3. Create one field per predictor using the COALESCE function:
    COALESCE( field_name, 'null' )
    This is needed because we need to join this view with the view defined over the ML Web service, to dynamically pass all the values that the ML Web Service needs to emit the prediction. The Denodo optimizer will push down the tree view a not null condition on each field that enters the join conditions. While this is the expected behaviour in normal situations, this is not what we want in this case: the web service must be called even if some predictor value is missing. The missing value, if present, will be properly handled by the ML Web Service.
  4. Set building_id and timestamp_in as primary keys by clicking on the corresponding key icon (). The column meter can be removed from the primary keys definition as we filtered on it.
  5. Change the view name to iv_prediction_service_ondemand_v1.

To test your view you can run the following query:

SELECT *

FROM dstutorial.iv_prediction_service_ondemand_v1

WHERE CAST('localdate',timestamp_in) ='2017-01-01'

AND building_id = 7

It should return 24 rows and 13 columns in about 80 seconds.

Now, we have to join the views iv_prediction_service_ondemand_v1 and bv_ml_prediction_service on all the predictor fields.

So let's select the folder /02_integration/prediction/, then click on , then New->Join. Drag and drop the view iv_prediction_service_ondemand_v1 and bv_ml_prediction_service in the work area, then:

  1. Click on for view iv_prediction_service_ondemand_v1 and drag the generated arrow on top of bv_ml_prediction_service.
  2. In the bottom menu, link all the predictor fields from both views.
  3. In the Output tab, remove all the predictor fields and rename the view as iv_prediction_service_ondemand_v1_final.
  4. Save.

Test the view with the following query iv_prediction_service_ondemand_v1_final:

SELECT *

FROM dstutorial.iv_prediction_service_ondemand_v1_final

WHERE building_id = 7

AND timestamp_in = TIMESTAMP '2017-08-11 00:00:00'

This query must return a single row in a few seconds, with a non null value for predicted_meter_reading.

We test this query as it is of the same type as the queries that will be generated by our final data service, that we are going to build and deploy now.

Let us now design our building energy consumption service:

  • It will have two mandatory parameters building_id and timestamp_in and give back to the requester the prediction (predicted_meter_reading) for the combination of those variables.
  • The output will be JSON-formatted.
  • It will request authentication with user and password mlservice_usr and mlservice_usr respectively.

To create it, follow these instructions:

  • Select folder 05_data_services, click on , then New->Data service->REST Web Service.
  • Enter the name meterreadingpredondemand_v1 then drag and drop iv_prediction_service_ondemand_v1_final in the Resources table
  • Expand the columns of iv_prediction_service_ondemand_v1_final by clicking on then:
  • Remove meter and site_id by clicking on .
  • Replicate the following setup for the parameters:

  • Switch to the Settings tab, set the default representation to JSON set the Authentication to HTTP Basic and enter mlservice_usr as login and password.
  • It is recommended to describe the instructions to call the service as well as validity conditions in the description. For example, you can copy and paste the following text in the Description field, Metadata panel:

This is the Building Energy Consumption Prediction Service.

The predictions are valid for the Chilled Water meter reading at site University of Central Florida, Orlando, FL.

Input parameters:

- timestamp_in: with format YYYY-MM-DDTHH:MM:DD, it must be in the range 2017-01-01 and 2018-12-31 both included

- building_id: must be in range between 0 and 104 both included

  • Save.
  • Click on , confirm the password for dstutorial_usr. Notice that while the requester will need to authenticate with user mlservice_usr, Denodo will run the corresponding query with user dstutorial_usr.

The Prediction Web Service is now deployed and ready to be consumed by end applications. You can see and change the deployment status of any Denodo REST Web Service by using the Web Service Container menu accessible from the Tools menu, located at the top-right of the screen.

To test the REST Web Service, you can query it with your preferred client, for example the browser.

Here an example:

http://denodo.dstutorial.com:9090/server/dstutorial/meterreadingpredondemand_v1/views/iv_prediction_service_ondemand_v1_final?building_id=7&timestamp_in=2017-01-20T10:00:00

As an exercise, you may want to build another REST data service that instead of the timestamp would require the parameter date, thus returning all the predictions for a given building and meter, for a whole day. To start with, you should define a new field in view predicted_meter_reading_ondemand_v1_final, that builds the field date from the field timestamp_in.

The tutorial is now ended. We have built together a building energy consumption prediction system that exposes its predictions in REST data service.

We have begun the journey from importing raw data sources, then explored them in the business-oriented Data Catalog and the analytics notebook-based environment Apache Zeppelin for Denodo.

Later, we have logically combined the data in the Web Design Studio, and after training and saving the model again in Apache Zeppelin for Denodo, we have published the REST data service that enables the consumption of the predictions, in real-time from any application.

We hope that through this hands-on real-world example we were able to clearly show how the Denodo Platform brings agility and value to all stages of machine learning-focused projects and we have enabled you with the necessary background and knowledge to start thinking about data virtualization implementations in other analytical initiatives.