Task 3 - Exploratory Data Analysis with Apache Zeppelin for Denodo
The focus of this part is on exploratory data analysis, an essential preparatory activity for the training of the learning algorithm.
The Denodo Platform includes a notebook-like application, Apache Zeppelin for Denodo, that enables the Data Scientists to build and run their data queries, perform advanced data analysis and easily and nicely plot the results.
In Apache Zeppelin for Denodo, you can alternate markdown-formatted text with executable code, by using one of the available interpreters (among them, the Denodo interpreter itself, Python, R and Spark): this implies that you can define your query in a code cell, run it and graphically build a plot with its results and then write down your notes that would explain what you did and/or your conclusions to be shared with peers or to be accessed at a later time.
In this part you will:
- Sign into Apache Zeppelin for Denodo.
- Go through the Exploratory Data Analysis notebook that we have already created.
- Create and run your own VQL queries, that leverage the predefined Denodo interpreter.
- Build and customize visualizations using built-in plotting capabilities.
At the end of this part you will have a clear understanding of the data preparation steps that you need to complete to get the data ready for machine learning algorithm ingestion.
Let's get started by opening the Apache Zeppelin for Denodo url with your browser.

Username: dstutorial_usr
Password: dstutorial_usr
In the Welcome page, you will see that some notebooks have been defined.
The notebook that we are going to use for this part is 01-Exploratory-Data-Analysis (you can use 01-Exploratory-Data-Analysis-Completed if you have doubts and have a look at a working solution).
Click on 01-Exploratory-Data-Analysis to open it.

Now take the time to explore the main functionalities of the application:
- The notebook is organized in paragraphs (or cells). An interpreter is associated with each paragraph and the paragraph is executable with the interpreter.
-
The definition of the interpreter for a paragraph is done by putting in the paragraph first line the name of the interpreter prefixed with the percent (%) sign, for example
%denodo
indicates that you can write and execute VQL code in that cell. -
Several actions can be triggered in the top menu: among them,
to run all the cells of the notebook,
to export it and
to clone it.
-
Cells can be executed by clicking
in the top-far right menu. Other options are available in the menu, you can display them by clicking on
.
Now, that you are more familiar with the Zeppelin UI, you can start reading the notebook, executing the queries against the Denodo server and analyse the results.
The following short screen recording shows how to run a query, visualize the results in table format and transform the table into a simple line chart.

In this notebook we have explored the data available for the Machine Learning algorithm.
Thanks to Zeppelin interactive data analysis capabilities, we now understand the data much better both from a technical point of view (data types, volumes, join keys) as well as purely from a data science perspective (distribution of input variables, correlation with target variables …). The analysis can be extended as desired with new queries and/or visualizations.
Let us stress out that we performed our analysis with a single source, the Denodo server, that is taking care of abstracting the underlying data sources and with a single interpreter, the Denodo SQL-like interpreter (VQL) and engine that chooses the most performant plan for queries execution.
At this time, let's summarize what the data preparation steps to get the data ready to be fed to the ML algorithm:
- Join the meter readings data with the time-related data
-
join conditions:
datepart(iv_meter_reading.timestamp) = iv_dim_date.d_date
- variables to be kept:
-
Note: This join view has already been prepared as
iv_meter_reading_01_date
and must be used for later joins.
column name | comment | transformations |
---|---|---|
d_day_name
|
Change name, remove prefix d_
|
to be one-hot-encoded |
d_weekend
|
Change name, remove prefix d_
|
to be one-hot-encoded |
- Join the meter readings data with the building data
-
join conditions:
iv_building_building_location_01.building_id = iv_meter_reading_01_date.building_id
- join type: LEFT
- variables to be kept:
-
Note: This join view has already been prepared as
iv_meter_reading_03_date_building
and must be used for later joins.
column name | comment | transformations |
---|---|---|
site_id
|
to be used in join | |
building_id
|
informative variable | |
timestamp_in
|
to be used in join | |
meter
|
informative variable | |
row_id
|
informative variable | |
istrain
|
separate train and test data | |
square_feet
|
predictive variable | to be transformed to number |
primary_use
|
predictive variable | to be one-hot-encoded |
year_built
|
useful to create age, a potential predictive variable | to be transformed to number |
floor_count
|
predictive variable | |
location_confirmed
|
used in filtering validated sites | |
state
|
used in filtering validated sites | |
meter_reading
|
target variable |
- Join the meter readings data with the hourly weather data
-
join conditions:
iv_weather.site_id = iv_meter_reading_03.site_id AND previous hour(iv_weather.timestamp) = iv_meter_reading_03.timestamp
- join type: LEFT
- variables to be kept:
-
output view:
iv_meter_reading_05_date_building_weather
column name | comment | transformations |
---|---|---|
air_temperature
|
||
cloud_coverage
|
||
dew_temperature
|
||
precip_depth_1_hr
|
||
sea_level_pressure
|
||
wind_direction
|
cosine function | |
wind_speed
|
- Join the meter readings data with the GHCN weather data (this join is going to be merged with the previous one)
-
join conditions with daily GHCN files:
iv_meter_reading_03_date_building.site_id = iv_all_sites_weather_daily_grp_by_site.site_id AND datepart(iv_meter_reading_03_date_building.timestamp_in) = previous day(iv_all_sites_weather_daily_grp_by_site.date)
-
join conditions with monthly GHCN files:
iv_meter_reading_03_date_building.site_id = iv_all_sites_weather_monthly_grp_by_site.site_id AND monthpart(iv_meter_reading_03_date_building.timestamp_in) = previous month(iv_all_sites_weather_daily_grp_by_site.date)
- join type: LEFT
- variables to be kept:
column name | comment | transformations |
---|---|---|
daily_grp_by_site.prcp |
||
daily_grp_by_site.snow |
||
daily_grp_by_site.tmin |
||
daily_grp_by_site.tmax |
||
daily_grp_by_site.snwd |
||
daily_grp_by_site.tobs |
||
daily_grp_by_site.awnd |
||
daily_grp_by_site.wsf2 |
||
daily_grp_by_site.wsf5 |
||
daily_grp_by_site.tavg |
||
monthly_grp_by_site.dp01 |
||
monthly_grp_by_site.dp10 |
||
monthly_grp_by_site.dsnw |
||
monthly_grp_by_site.emsn |
||
monthly_grp_by_site.emxp |
||
monthly_grp_by_site.prcp |
||
monthly_grp_by_site.snow |
- Join the training and test data with the holidays data
-
join keys:
last two characters of(iv_meter_reading_05_date_building_weather.location) = iv_holidays_state_cd.state
- join type: LEFT
- variables to be kept:
-
output view:
iv_meter_reading_07_date_building_weather_holidays
column name | comment | transformations |
---|---|---|
is_holiday
|
CASE WHEN iv_holidays.date = FORMATDATE( 'yyyy-MM-dd&', iv_meter_reading_05_date_building_weather.timestamp_in) THEN true ELSE false
|
To summarize, our action plan is to construct three joins:
-
The fact table
iv_meter_readings_03
with hourly weather data,iv_weather
. - The resulting view with GHCN daily and monthly data.
- The resulting view with the holidays data.
Along the way, we are going to apply any field transformation and any other operation that may be needed.
These tasks will be accomplished in the Web Design Studio, that we already know from a previous chapter, so let's head back to that application to get the work done.