Task 4 - Data Modeling in Web Design Studio (part II)

Data Science and Machine Learning

In the previous parts we went through the abstraction and first integration steps of a new data source (Task 1), used the Data Catalog to define a business-friendly way to organize the available data assets through tags and categories (Task 2) and performed an interactive data analysis in Apache Zeppelin for Denodo (Task 3).

With a clear idea of the data preparation steps needed, let us now move back to the Web Design Studio, the main data modeling interface in the Platform, to build the data preparation flow for the machine learning algorithm training and validation.

The data preparation flow we are going to develop consists of the following steps:

  • Join the fact table iv_meter_reading_03_date_building with hourly weather data, iv_weather as well as GHCN daily and monthly data. We will do this 3-way join in the same transformation.
  • Join the resulting view with the holidays data.
  • Define a selection view with the final adjustments, formatting and fields transformations.

Open the Web Design Studio url and login.

User: dstutorial_usr
Password: dstutorial_usr

Join with weather data

To create the first join transformation, do the following:

  1. Select folder 10_prepared_data_for_ml/staging, click on then New->Join.
  2. Drag and drop the views iv_meter_reading_03_date_building and iv_weather in the work area.
  3. Click on for view iv_meter_reading_03_date_building and drag the generated arrow on top of iv_weather.
  4. In the generated column lists at the bottom, link site_id and timestamp_in, from the left entity to the right entity.
  5. Drag and drop the view iv_all_sites_weather_monthly in the work area.
  6. Click on for view iv_meter_reading_03_date_building and drag the generated arrow on top of iv_all_sites_weather_monthly
  7. In the generated column lists at the bottom link site_id on the left with site on the right. Do the same with timestamp_in on the left with date on the right. The join condition is not valid yet, we will adjust it in a later step.
  8. Repeat steps 5 to 7 with iv_all_sites_weather_daily.
  9. Go to the Join Conditions panel, you'll find three join conditions corresponding to the three views that we are joining. In the first one, change the join type to LEFT OUTER then click on edit and substitute iv_weather.timestamp_in with ADDHOUR(iv_weather.timestamp_in,1).
  1. In the second join condition, change the join type to LEFT OUTER, then click on edit and replace the condition involving timestamp_in with:

(formatdate('yyyy-MM', iv_meter_reading_03_date_building.timestamp_in) = formatdate('yyyy-MM', addmonth(to_localdate('yyyy-MM', iv_all_sites_weather_monthly_grp_by_site.date), 1))

  1. In the third join condition, change the type to LEFT OUTER, then click on edit and replace the condition involving timestamp_in with the following one:

(formatdate('yyyy-MM-dd', iv_meter_reading_03_date_building.timestamp_in) = formatdate('yyyy-MM-dd', addday(to_localdate('yyyy-MM-dd', iv_all_sites_weather_daily_grp_by_site.date), 1))

  1. Go to the Output panel, and do the following actions:
  • rename the view to iv_meter_reading_05_date_building_weather
  • keep and rename (if needed) the following fields as per instructions detailed in this table:
Source Table Target field
iv_meter_reading_03_date_building Keep all the fields except stations_file and additional_weather_info
iv_weather Keep all the fields except site_id,timestamp_in and istrain
iv_all_sites_weather_monthly_grp_by_site Keep all the fields except site and date, prefix them with monthly_
iv_all_sites_weather_daily_grp_by_site Keep all the fields except site and date, prefix them with daily_
  1. Set building_id, meter and timestamp as primary keys (the corresponding PK should look like once defined as primary keys).

If you want to adjust the fields display order, you can check the field(s) and move it(them) up or down with the arrows that you find at the right of the panel ()

TIP

If you inadvertently deleted a field you can restore by clicking on Restore ()

TIP
  1. Save the view

To check that the view is correct open a VQL Shell (Tools->VQL Shell in the top menu) and run the following query (make sure that you are retrieving all rows):

SELECT * FROM
dstutorial.iv_meter_reading_05_date_building_weather
WHERE site_id = 0
AND EXTRACT(MONTH FROM timestamp_in) = 2;

It must return 245273 records and 41 columns in about 45 seconds. Moreover, visually check that fields coming from iv_weather are mostly non-missing. We will do a more in-depth missing values analysis later on.

Then run the following query:

SELECT COUNT(*) FROM
dstutorial.iv_meter_reading_05_date_building_weather
WHERE site_id = 0;

You should get 3336742 as the result in about 4 minutes.

Join with holiday data

To create the second join transformation, do the following:

  1. Select folder 10_prepared_data_for_ml/staging, click on then New->Join.
  2. Drag and drop the views iv_meter_reading_05_date_building_weather and iv_holidays_state_cd in the work area.
  3. Click on for view iv_meter_reading_05_date_building_weather and drag the generated arrow on top of iv_weather.
  4. In the generated column lists at the bottom, link location to state_cd and timestamp_in to date .
  5. Go to the Join Conditions panel, change the join type to LEFT OUTER, then click on edit and substitute the condition with the following one:
TRIM(SUBSTR(TRIM(location),(LEN(TRIM(location))-2),4)) = state_cd AND FORMATDATE('yyyy-MM-dd', timestamp_in) = date
  1. Go to the Output panel, rename the view to iv_meter_reading_07_date_building_weather_holidays and remove the field iv_holidays_state_cd.state_cd.
  2. Rename the field date to is_holiday.
  3. Set building_id, meter and timestamp_in as primary keys (the corresponding PK should look like once defined as primary keys).
  4. Save.

To check that the view is correct open a VQL Shell (Tools->VQL Shell in the top menu) and run the following queries (make sure that you are retrieving all rows):

SELECT count(*)
FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays
WHERE site_id = 0;

It must return 3336742 records in about 4 minutes.

Then:

SELECT DISTINCT FORMATDATE('yyyy-MM-dd', timestamp_in), is_holiday, holiday_description
FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays
WHERE site_id = 0
AND FORMATDATE('yyyy-MM', timestamp_in) = '2016-01';

You should get 31 rows and 3 columns in about 60 seconds. Visually check that is_holiday is empty for all days except 2016-01-01 (New Year's Day) and 2016-01-18 (Martin Luther King Jr. Day). This field will be transformed to a boolean flag later.

Preparation of the final view

Our data is almost ready to be fed into the machine learning algorithm. However there are still some fields to be transformed.

  • floor_count: it is a text column, but it represents an integer.
  • year_built: we can compute the building age on this column as it is a more intuitive and meaningful predictor.
  • wind_direction: we can apply the cosine function, to better represent the information that it carries
  • timestamp_in: the target variable has a strong dependence on the hour of the day. Let us extract it, to let the learning algorithm leverage it.

Let us apply all this transformation logic in a new selection view. As we have been defining the views graphically so far, we are going to create it in the VQL Shell, the interactive editor and command line environment embedded in the Web Design Studio. To this end, we will build and then execute the appropriate CREATE VIEW statement.

To start with, let us select iv_meter_reading_07_date_building_weather_holidays, then click on and New->VQL Shell >Select .

Let us apply the following modifications to the displayed SELECT clause:

  • Overwrite wind_direction with the following expression:
    COS(RADIANS(CAST('int',wind_direction))) AS wind_direction
  • Transform is_holiday to a boolean with the following expression:
    CASE WHEN (is_holiday IS null) THEN false ELSE true END is_holiday
  • Create the column is_weekend with the boolean form of column weekend, that can be dropped:
    CASE WHEN(iv_meter_reading_07_date_building_weather_holidays.weekend = 'Y') THEN true ELSE false END is_weekend
  • Calculate building_age with the following expression (year_built can be dropped afterwards):
    (GETYEAR(CURRENT_DATE())-year_built) AS building_age
  • Calculate reading_hour with the following expression:
    COS(RADIANS(GETHOUR(timestamp_in)) AS reading_hour
  • Cast all the columns of type decimal to float
  • Prefix the SELECT statement with:
    CREATE VIEW meter_reading_final FOLDER='/10_prepared_data_for_ml' PRIMARY KEY('building_id','meter', 'timestamp') AS

The final CREATE VIEW statement will look like the following:


CREATE OR REPLACE VIEW meter_reading_final FOLDER='/10_prepared_data_for_ml'
PRIMARY KEY ( 'building_id' , 'meter' , 'timestamp_in' )
AS SELECT
timestamp_in,
building_id,
meter,
site_id,
istrain,
CAST('float',meter_reading) AS meter_reading,
day_name,
CASE WHEN (weekend = 'Y') THEN true ELSE false END is_weekend,
primary_use,
square_feet,
(GETYEAR(CURRENT_DATE())-year_built) AS building_age,
CAST('int',floor_count) AS floor_count,
location,
county,
state,
climate_zone,
location_confirmed,
CAST('float',air_temperature) AS air_temperature,
CAST('float',cloud_coverage) AS cloud_coverage,
CAST('float',dew_temperature) AS dew_temperature,
CAST('float',precip_depth_1_hr) AS precip_depth_1_hr,
CAST('float',sea_level_pressure) AS sea_level_pressure,
COS(RADIANS(CAST('int',wind_direction))) AS wind_direction,
CAST('float',wind_speed) AS wind_speed,
CAST('float',monthly_dp01) AS monthly_dp01,
CAST('float',monthly_dp10) AS monthly_dp10,
CAST('float',monthly_dsnw) AS monthly_dsnw,
CAST('float',monthly_emsn) AS monthly_emsn,
CAST('float',monthly_emxp) AS monthly_emxp,
CAST('float',monthly_prcp) AS monthly_prcp,
CAST('float',monthly_snow) AS monthly_snow,
CAST('float',daily_prcp) AS daily_prcp,
CAST('float',daily_snow) AS daily_snow,
CAST('float',daily_tmin) AS daily_tmin,
CAST('float',daily_tmax) AS daily_tmax,
CAST('float',daily_snwd) AS daily_snwd,
CAST('float',daily_tobs) AS daily_tobs,
CAST('float',daily_awnd) AS daily_awnd,
CAST('float',daily_wsf2) AS daily_wsf2,
CAST('float',daily_wsf5) AS daily_wsf5,
CAST('float',daily_tavg) AS daily_tavg,
CASE WHEN (is_holiday IS null ) THEN false ELSE true END is_holiday,
holiday_description,
COS(RADIANS(GETHOUR(timestamp_in))) AS reading_hour
FROM dstutorial.iv_meter_reading_07_date_building_weather_holidays;

Before running it, make sure you are connected to the database dstutorial. Run it and then refresh the Server Explorer (to do that, select the database dstutorial, then ->Refresh), you'll find the view meter_reading_final in folder 10_prepared_data_for_ml.

A final step is needed: as meter_reading_final is the view to be exposed to the machine learning environment, we would like to protect the backend data sources from excessive load and at the same time we would like to always ensure an acceptable data extraction time. For this reason we will make it compulsory to specify at least one of the sites by defining a view parameter.

To do that:

  1. Open the view meter_reading_final.
  2. Go to EDIT->Model Selection->Edit view parameters.
  3. Define a new view parameter called input_site_id of type int, click Add, then click Ok.

  1. Go to the Where Conditions menu and add the condition site_id = input_site_id.
  2. Save.

Test the view. To this end, run in a VQL Shell the following query (make sure you are retrieving all rows):

SELECT count(*)
FROM dstutorial.meter_reading_final
WHERE input_site_id = 0;

You should get 3336742 as the result in about 4 minutes.

If the obtained result is ok, you can run a second query in the VQLShell (check the Retrieve all rows box)

SELECT *
FROM dstutorial.meter_reading_final
WHERE input_site_id = 0
AND building_id = 10
AND EXTRACT(MONTH FROM timestamp_in) = 6
AND EXTRACT(YEAR FROM timestamp_in) = 2017 ;

You should obtain 720 rows and 45 columns in a little bit less than 2 minutes. Take the time to check if the transformations applied are working as expected. You can scroll right and down in the data preview panel to see all the fields.

Notice that in these queries we specified a value for parameter input_site_id. If a value for this parameter is not specified, the query will fail with a No search methods ready to be run error. This is the expected behaviour.

It is now time to wrap this part up: we have prepared the data for the machine learning algorithm training by enriching the meter readings table with hourly weather data, GHCN weather data and holidays-related data. All these transformations have been logically defined in the Web Design Studio, meaning that we have generated just the VQL code that implements our modeling logic, with no data replication. When executing the query, Denodo will find the best execution plan and get the results on the fly from the data sources.

It is interesting to open the Tree View of the final view, meter_reading_final, to visually grasp the complexity of the data transformations that the input data must flow through. We recommend spending some time in reviewing the Tree View of the final view, collapsing/expanding nodes as needed. Please note that you have zoom in/zoom out and exporting functionality as shown in the following screen recording:

Let us move on to learn how to build and run our Machine Learning validation, training and model persistence code in Apache Zeppelin for Denodo, using mainstream Python data science libraries, such as pandas and scikit-learn.