Task 3 - Exploratory Data Analysis with Apache Zeppelin for Denodo

Data Science and Machine Learning

The focus of this part is on exploratory data analysis, an essential preparatory activity for the training of the learning algorithm.

The Denodo Platform includes a notebook-like application, Apache Zeppelin for Denodo, that enables the Data Scientists to build and run their data queries, perform advanced data analysis and easily and nicely plot the results.

In Apache Zeppelin for Denodo, you can alternate markdown-formatted text with executable code, by using one of the available interpreters (among them, the Denodo interpreter itself, Python, R and Spark): this implies that you can define your query in a code cell, run it and graphically build a plot with its results and then write down your notes that would explain what you did and/or your conclusions to be shared with peers or to be accessed at a later time.

In this part you will:

  • Sign into Apache Zeppelin for Denodo.
  • Go through the Exploratory Data Analysis notebook that we have already created.
  • Create and run your own VQL queries, that leverage the predefined Denodo interpreter.
  • Build and customize visualizations using built-in plotting capabilities.

At the end of this part you will have a clear understanding of the data preparation steps that you need to complete to get the data ready for machine learning algorithm ingestion.

Let's get started by opening the Apache Zeppelin for Denodo url with your browser.

Username: dstutorial_usr
Password: dstutorial_usr

In the Welcome page, you will see that some notebooks have been defined.

The notebook that we are going to use for this part is 01-Exploratory-Data-Analysis (you can use 01-Exploratory-Data-Analysis-Completed if you have doubts and have a look at a working solution).

Click on 01-Exploratory-Data-Analysis to open it.

Now take the time to explore the main functionalities of the application:

  • The notebook is organized in paragraphs (or cells). An interpreter is associated with each paragraph and the paragraph is executable with the interpreter.
  • The definition of the interpreter for a paragraph is done by putting in the paragraph first line the name of the interpreter prefixed with the percent (%) sign, for example %denodo indicates that you can write and execute VQL code in that cell.
  • Several actions can be triggered in the top menu: among them, to run all the cells of the notebook, to export it and to clone it.
  • Cells can be executed by clicking in the top-far right menu. Other options are available in the menu, you can display them by clicking on .

Now, that you are more familiar with the Zeppelin UI, you can start reading the notebook, executing the queries against the Denodo server and analyse the results.

The following short screen recording shows how to run a query, visualize the results in table format and transform the table into a simple line chart.

In this notebook we have explored the data available for the Machine Learning algorithm.

Thanks to Zeppelin interactive data analysis capabilities, we now understand the data much better both from a technical point of view (data types, volumes, join keys) as well as purely from a data science perspective (distribution of input variables, correlation with target variables …). The analysis can be extended as desired with new queries and/or visualizations.

Let us stress out that we performed our analysis with a single source, the Denodo server, that is taking care of abstracting the underlying data sources and with a single interpreter, the Denodo SQL-like interpreter (VQL) and engine that chooses the most performant plan for queries execution.

At this time, let's summarize what the data preparation steps to get the data ready to be fed to the ML algorithm:

The code specified in the join conditions is pseudo-code. The VQL functions and expressions that implement the trasnformations will be specified in the exercise.

  • Join the meter readings data with the time-related data
    • join conditions: datepart(iv_meter_reading.timestamp) = iv_dim_date.d_date
    • variables to be kept:
    • column name comment transformations
      d_day_name Change name, remove prefix d_ to be one-hot-encoded
      d_weekend Change name, remove prefix d_ to be one-hot-encoded
    • Note: This join view has already been prepared as iv_meter_reading_01_date and must be used for later joins.
  • Join the meter readings data with the building data
    • join conditions: iv_building_building_location_01.building_id = iv_meter_reading_01_date.building_id
    • join type: LEFT
    • variables to be kept:
    • column name comment transformations
      site_id to be used in join
      building_id informative variable
      timestamp_in to be used in join
      meter informative variable
      row_id informative variable
      istrain separate train and test data
      square_feet predictive variable to be transformed to number
      primary_use predictive variable to be one-hot-encoded
      year_built useful to create age, a potential predictive variable to be transformed to number
      floor_count predictive variable
      location_confirmed used in filtering validated sites
      state used in filtering validated sites
      target variable
    • Note: This join view has already been prepared as iv_meter_reading_03_date_building and must be used for later joins.
  • Join the meter readings data with the hourly weather data
    • join conditions: iv_weather.site_id = iv_meter_reading_03.site_id AND previous hour(iv_weather.timestamp) = iv_meter_reading_03.timestamp
    • join type: LEFT
    • variables to be kept:
    • column name comment transformations
      wind_direction cosine function
    • output view: iv_meter_reading_05_date_building_weather
  • Join the meter readings data with the GHCN weather data (this join is going to be merged with the previous one)
    • join conditions with daily GHCN files: iv_meter_reading_03_date_building.site_id = iv_all_sites_weather_daily_grp_by_site.site_id AND datepart(iv_meter_reading_03_date_building.timestamp_in) = previous day(iv_all_sites_weather_daily_grp_by_site.date)
    • join conditions with monthly GHCN files: iv_meter_reading_03_date_building.site_id = iv_all_sites_weather_monthly_grp_by_site.site_id AND monthpart(iv_meter_reading_03_date_building.timestamp_in) = previous month(iv_all_sites_weather_daily_grp_by_site.date)
    • join type: LEFT
    • variables to be kept:
    • column name comment transformations
  • Join the training and test data with the holidays data
    • join keys: last two characters of(iv_meter_reading_05_date_building_weather.location) = iv_holidays_state_cd.state
    • join type: LEFT
    • variables to be kept:
    • column name comment transformations
      is_holiday CASE WHEN iv_holidays.date = FORMATDATE( 'yyyy-MM-dd&', iv_meter_reading_05_date_building_weather.timestamp_in) THEN true ELSE false
    • output view: iv_meter_reading_07_date_building_weather_holidays

To summarize, our action plan is to construct three joins:

  • The fact table iv_meter_readings_03 with hourly weather data, iv_weather.
  • The resulting view with GHCN daily and monthly data.
  • The resulting view with the holidays data.

Along the way, we are going to apply any field transformation and any other operation that may be needed.

These tasks will be accomplished in the Web Design Studio, that we already know from a previous chapter, so let's head back to that application to get the work done.