Task 4 - Data Modeling in Web Design Studio (part II)
In the previous parts we went through the abstraction and first integration steps of a new data source (Task 1), used the Data Catalog to define a business-friendly way to organize the available data assets through tags and categories (Task 2) and performed an interactive data analysis in Apache Zeppelin for Denodo (Task 3).
With a clear idea of the data preparation steps needed, let us now move back to the Web Design Studio, the main data modeling interface in the Platform, to build the data preparation flow for the machine learning algorithm training and validation.
The data preparation flow we are going to develop consists of the following steps:
-
Join the fact table
iv_meter_reading_03_date_building
with hourly weather data,iv_weather
as well as GHCN daily and monthly data. We will do this 3-way join in the same transformation. - Join the resulting view with the holidays data.
- Define a selection view with the final adjustments, formatting and fields transformations.
Open the Web Design Studio url and login.

User: dstutorial_usr
Password: dstutorial_usr
Join with weather data
To create the first join transformation, do the following:
-
Select folder
10_prepared_data_for_ml/staging
, click onthen New->Join.
-
Drag and drop the views
iv_meter_reading_03_date_building
andiv_weather
in the work area. -
Click on
for view
iv_meter_reading_03_date_building
and drag the generated arrow on top ofiv_weather
. -
In the generated column lists at the bottom, link
site_id
andtimestamp_in
, from the left entity to the right entity. -
Drag and drop the view
iv_all_sites_weather_monthly
in the work area. -
Click on
for view
iv_meter_reading_03_date_building
and drag the generated arrow on top ofiv_all_sites_weather_monthly
-
In the generated column lists at the bottom link
site_id
on the left withsite
on the right. Do the same withtimestamp_in
on the left with date on the right. The join condition is not valid yet, we will adjust it in a later step. -
Repeat steps 5 to 7 with
iv_all_sites_weather_daily
. -
Go to the Join Conditions panel, you'll find three join conditions corresponding to the three views that we are joining. In the first one, change the join type to LEFT OUTER then click on edit and substitute
iv_weather
.timestamp_in
withADDHOUR(iv_weather.timestamp_in,1)
.

-
In the second join condition, change the join type to LEFT OUTER, then click on edit and replace the condition involving
timestamp_in
with:
(formatdate('yyyy-MM', iv_meter_reading_03_date_building.timestamp_in) = formatdate('yyyy-MM', addmonth(to_localdate('yyyy-MM', iv_all_sites_weather_monthly_grp_by_site.date), 1))

-
In the third join condition, change the type to LEFT OUTER, then click on edit and replace the condition involving
timestamp_in
with the following one:
(formatdate('yyyy-MM-dd', iv_meter_reading_03_date_building.timestamp_in) = formatdate('yyyy-MM-dd', addday(to_localdate('yyyy-MM-dd', iv_all_sites_weather_daily_grp_by_site.date), 1))

- Go to the Output panel, and do the following actions:
-
rename the view to
iv_meter_reading_05_date_building_weather
- keep and rename (if needed) the following fields as per instructions detailed in this table:
Source Table | Target field |
---|---|
iv_meter_reading_03_date_building |
Keep all the fields except stations_file and additional_weather_info |
iv_weather |
Keep all the fields except site_id ,timestamp_in and istrain |
iv_all_sites_weather_monthly_grp_by_site |
Keep all the fields except site and date , prefix them with monthly_ |
iv_all_sites_weather_daily_grp_by_site |
Keep all the fields except site and date , prefix them with daily_ |
-
Set
building_id
,meter
andtimestamp
as primary keys (the corresponding PK should look likeonce defined as primary keys).
- Save the view

To check that the view is correct open a VQL Shell (Tools->VQL Shell in the top menu) and run the following query (make sure that you are retrieving all rows):
SELECT * FROM
dstutorial.iv_meter_reading_05_date_building_weather
WHERE site_id = 0
AND EXTRACT(MONTH FROM timestamp_in) = 2;
It must return 245273 records and 41 columns in about 45 seconds. Moreover, visually check that fields coming from iv_weather
are mostly non-missing. We will do a more in-depth missing values analysis later on.
Then run the following query:
SELECT COUNT(*) FROM
dstutorial.iv_meter_reading_05_date_building_weather
WHERE site_id = 0;
You should get 3336742 as the result in about 4 minutes.
Join with holiday data
To create the second join transformation, do the following:
-
Select folder
10_prepared_data_for_ml/staging
, click onthen New->Join.
-
Drag and drop the views
iv_meter_reading_05_date_building_weather
andiv_holidays_state_cd
in the work area. -
Click on
for view
iv_meter_reading_05_date_building_weather
and drag the generated arrow on top ofiv_weather
. -
In the generated column lists at the bottom, link
location
tostate_cd
andtimestamp_in
todate
. - Go to the Join Conditions panel, change the join type to LEFT OUTER, then click on edit and substitute the condition with the following one:
TRIM(SUBSTR(TRIM(location),(LEN(TRIM(location))-2),4)) = state_cd
AND
FORMATDATE('yyyy-MM-dd', timestamp_in) = date

-
Go to the Output panel, rename the view to
iv_meter_reading_07_date_building_weather_holidays
and remove the fieldiv_holidays_state_cd.state_cd
. -
Rename the field date to
is_holiday
. -
Set
building_id
,meter
andtimestamp_in
as primary keys (the corresponding PK should look likeonce defined as primary keys).
- Save.
To check that the view is correct open a VQL Shell (Tools->VQL Shell in the top menu) and run the following queries (make sure that you are retrieving all rows):
SELECT count(*)
FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays
WHERE site_id = 0;
It must return 3336742 records in about 4 minutes.
Then:
SELECT DISTINCT FORMATDATE('yyyy-MM-dd', timestamp_in), is_holiday, holiday_description
FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays
WHERE site_id = 0
AND FORMATDATE('yyyy-MM', timestamp_in) = '2016-01';
You should get 31 rows and 3 columns in about 60 seconds. Visually check that is_holiday
is empty for all days except 2016-01-01 (New Year's Day) and 2016-01-18 (Martin Luther King Jr. Day). This field will be transformed to a boolean flag later.
Preparation of the final view
Our data is almost ready to be fed into the machine learning algorithm. However there are still some fields to be transformed.
-
floor_count
: it is a text column, but it represents an integer. -
year_built
: we can compute the building age on this column as it is a more intuitive and meaningful predictor. -
wind_direction
: we can apply the cosine function, to better represent the information that it carries -
timestamp_in
: the target variable has a strong dependence on the hour of the day. Let us extract it, to let the learning algorithm leverage it.
Let us apply all this transformation logic in a new selection view. As we have been defining the views graphically so far, we are going to create it in the VQL Shell, the interactive editor and command line environment embedded in the Web Design Studio. To this end, we will build and then execute the appropriate CREATE VIEW statement.
To start with, let us select iv_meter_reading_07_date_building_weather_holidays
, then click on and New->VQL Shell >Select
.
Let us apply the following modifications to the displayed SELECT clause:
-
Overwrite
wind_direction
with the following expression:COS(RADIANS(CAST('int',wind_direction))) AS wind_direction
-
Transform
is_holiday
to a boolean with the following expression:CASE WHEN (is_holiday IS null) THEN false ELSE true END is_holiday
-
Create the column
is_weekend
with the boolean form of columnweekend
, that can be dropped:CASE WHEN(iv_meter_reading_07_date_building_weather_holidays.weekend = 'Y') THEN true ELSE false END is_weekend
-
Calculate
building_age
with the following expression (year_built
can be dropped afterwards):(GETYEAR(CURRENT_DATE())-year_built) AS building_age
-
Calculate
reading_hour
with the following expression:COS(RADIANS(GETHOUR(
timestamp_in
)) AS reading_hour - Cast all the columns of type decimal to float
-
Prefix the SELECT statement with:
CREATE VIEW meter_reading_final FOLDER='/10_prepared_data_for_ml' PRIMARY KEY('building_id','meter', 'timestamp') AS
The final CREATE VIEW statement will look like the following:
CREATE OR REPLACE VIEW meter_reading_final FOLDER='/10_prepared_data_for_ml'
PRIMARY KEY ( 'building_id' , 'meter' , 'timestamp_in' )
AS SELECT
timestamp_in,
building_id,
meter,
site_id,
istrain,
CAST('float',meter_reading) AS meter_reading,
day_name,
CASE WHEN (weekend = 'Y') THEN true ELSE false END is_weekend,
primary_use,
square_feet,
(GETYEAR(CURRENT_DATE())-year_built) AS building_age,
CAST('int',floor_count) AS floor_count,
location,
county,
state,
climate_zone,
location_confirmed,
CAST('float',air_temperature) AS air_temperature,
CAST('float',cloud_coverage) AS cloud_coverage,
CAST('float',dew_temperature) AS dew_temperature,
CAST('float',precip_depth_1_hr) AS precip_depth_1_hr,
CAST('float',sea_level_pressure) AS sea_level_pressure,
COS(RADIANS(CAST('int',wind_direction))) AS wind_direction,
CAST('float',wind_speed) AS wind_speed,
CAST('float',monthly_dp01) AS monthly_dp01,
CAST('float',monthly_dp10) AS monthly_dp10,
CAST('float',monthly_dsnw) AS monthly_dsnw,
CAST('float',monthly_emsn) AS monthly_emsn,
CAST('float',monthly_emxp) AS monthly_emxp,
CAST('float',monthly_prcp) AS monthly_prcp,
CAST('float',monthly_snow) AS monthly_snow,
CAST('float',daily_prcp) AS daily_prcp,
CAST('float',daily_snow) AS daily_snow,
CAST('float',daily_tmin) AS daily_tmin,
CAST('float',daily_tmax) AS daily_tmax,
CAST('float',daily_snwd) AS daily_snwd,
CAST('float',daily_tobs) AS daily_tobs,
CAST('float',daily_awnd) AS daily_awnd,
CAST('float',daily_wsf2) AS daily_wsf2,
CAST('float',daily_wsf5) AS daily_wsf5,
CAST('float',daily_tavg) AS daily_tavg,
CASE WHEN (is_holiday IS null ) THEN false ELSE true END is_holiday,
holiday_description,
COS(RADIANS(GETHOUR(timestamp_in))) AS reading_hour
FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays;
Before running it, make sure you are connected to the database dstutorial
. Run it and then refresh the Server Explorer (to do that, select the database dstutorial
, then ->Refresh), you'll find the view
meter_reading_final
in folder 10_prepared_data_for_ml
.
A final step is needed: as meter_reading_final
is the view to be exposed to the machine learning environment, we would like to protect the backend data sources from excessive load and at the same time we would like to always ensure an acceptable data extraction time. For this reason we will make it compulsory to specify at least one of the sites by defining a view parameter.
To do that:
-
Open the view
meter_reading_final
. - Go to EDIT->Model Selection->Edit view parameters.
-
Define a new view parameter called
input_site_id
of typeint
, click Add, then click Ok.

-
Go to the Where Conditions menu and add the condition
site_id
= input_site_id
. - Save.
Test the view. To this end, run in a VQL Shell the following query (make sure you are retrieving all rows):
SELECT count(*)
FROM dstutorial.meter_reading_final
WHERE input_site_id = 0;
You should get 3336742 as the result in about 4 minutes.
If the obtained result is ok, you can run a second query in the VQLShell (check the Retrieve all rows box)
SELECT *
FROM dstutorial.meter_reading_final
WHERE input_site_id = 0
AND building_id = 10
AND EXTRACT(MONTH FROM timestamp_in) = 6
AND EXTRACT(YEAR FROM timestamp_in) = 2017 ;
You should obtain 720 rows and 45 columns in a little bit less than 2 minutes. Take the time to check if the transformations applied are working as expected. You can scroll right and down in the data preview panel to see all the fields.
Notice that in these queries we specified a value for parameter input_site_id
. If a value for this parameter is not specified, the query will fail with a No search methods
ready to be run error. This is the expected behaviour.
It is now time to wrap this part up: we have prepared the data for the machine learning algorithm training by enriching the meter readings table with hourly weather data, GHCN weather data and holidays-related data. All these transformations have been logically defined in the Web Design Studio, meaning that we have generated just the VQL code that implements our modeling logic, with no data replication. When executing the query, Denodo will find the best execution plan and get the results on the fly from the data sources.
It is interesting to open the Tree View of the final view, meter_reading_final
, to visually grasp the complexity of the data transformations that the input data must flow through. We recommend spending some time in reviewing the Tree View of the final view, collapsing/expanding nodes as needed. Please note that you have zoom in/zoom out and exporting functionality as shown in the following screen recording:

Let us move on to learn how to build and run our Machine Learning validation, training and model persistence code in Apache Zeppelin for Denodo, using mainstream Python data science libraries, such as pandas and scikit-learn.