Introduction and Tasks Outline

Data Science and Machine Learning

Welcome to the Data Science and Machine Learning Tutorial!

In this tutorial you will build and deploy a machine learning algorithm aimed at predicting the energy consumption of buildings with the help of various components of the Denodo Platform: the Data Catalog, the Web Design Studio, the Virtual DataPort Server and the notebook environment Apache Zeppelin for Denodo. Most of the data and the scenario are taken from the ASHRAE - Great Energy Predictor III competition, that has been hosted in the well-known data science contests site Kaggle in late 2019.

Denodo data virtualization can play a very important role in the development, management and deployment of data science and machine learning initiatives.

Starting from data exploration and analysis of all the internal and external data assets (Apache Zeppelin for Denodo and Data Catalog), through a powerful data modeling interface (Web Design Studio) and an advanced optimized query execution engine (Virtual DataPort Server), and ending with a flexible data publication layer, Denodo can enter the equation in several phases of your machine learning projects.

Throughout this real-world use case, we hope that you will grasp the full potential of the Denodo platform in:

  • Graphically exploring the data, discovering associations and labeling in a business-friendly way in the Data Catalog.
  • Performing advanced data analysis with the Denodo query language or your preferred one (ex. Python or R) in Apache Zeppelin for Denodo.
  • Logically cleaning, combining and preparing data from heterogeneous data sources on the Web Design Studio.
  • Fetching data from Denodo views into your machine learning runtime for model validation, training and persistence.
  • Building a prediction publication system that gets the predictors values consistently and in real-time.

Tasks Outline

Task 1 - Data Modeling in Web Design Studio (part I)

  • First steps with Web Design Studio.
  • Build a data transformation flow in Denodo: data source abstraction, data modeling tools and techniques.
  • Some considerations on Denodo optimizer: branch pruning.

Task 2 - Data Tagging And Exploration With Data Catalog

  • Create a category structure that allows organizing the views in a business-friendly manner.
  • Create a tag to group together all the views related to the project.
  • Associate a category and a tag to one or several views.
  • Build a query that combines different views.

Task 3 - Exploratory Data Analysis in Zeppelin for Denodo

  • First steps with Apache Zeppelin for Denodo.
  • Developing and running VQL queries against the Virtual DataPort Server.
  • Display the results in a tabular format or in charts.
  • Understand data model and sketch an action plan for integration of data for training table preparation.

Task 4 - Data Modeling with Web Design Studio (part II)

  • In Web Design Studio, create the join of the tables needed to get the data prepared for algorithm ingestion.

Task 5 - Machine Learning Modeling: Validation, Training and Persistence

  • Import data from Denodo into Python.
  • Use mainstream Python analytical libraries to do features engineering, model validation and training as well as grid search.
  • Persist the model to disk for predictions publication.

Task 6 - Making a Prediction Data Service

  • Understand and test the Python-based machine learning prediction service.
  • Build a data transformation flow in Web Design Studio, that joins the view on the Python-based machine learning predictions service back with the main data preparation flow, built in Task 4.
  • Create a REST data service that serves the predictions to consuming applications in json format with some mandatory parameters.

Conclusion