Task 5 - Machine Learning Modeling: Validation, Training and Persistence

Data Science and Machine Learning

In the previous parts of the tutorial, we have learnt how different Denodo Platform components can help in typical phases of Machine Learning projects: advanced data exploration and tagging, in Data Catalog (Task 2) and Apache Zeppelin for Denodo (Task 3) as well as data integration in Web Design Studio (Task 1 and 4).

In particular, in Task 4 we have prepared our training final training view, called meter_reading_final, through a pipeline that adds potential predictors from the building metadata table (size, number of floors …),  hourly, daily and monthly weather data at building location and information about holidays.

We are now moving back to Apache Zeppelin for Denodo for the core Machine Learning tasks:

  • Data ingestion from the Denodo view.
  • Last steps of features engineering.
  • Model choice, cross validation and training.
  • Model persistence on disk.

Open the Apache Zeppelin url in your browser, then login with dstutorial_usr as user and password.

In the notebook list, click on 02-ML-Algorithm-Training to open it.

Now you can start reading the notebook and executing the code cells. If you need to go back to a known working version of the notebook you can do so by switching to the notebook 02-ML-Algorithm-Training-Completed.

We are using exclusively the Python interpreter in this notebook. If you need to reinitialize the runtime environment you can click on in the top-right menu, then on corresponding to the Python interpreter.

Please refer to Task 3 in this guide for a quick tour of the UI main functionalities.

Some screenshots are reported here for reference:

If the notebook has been successfully run up to the final cell, the prediction model is now saved on disk, along with other complementary objects, in folder /tmp/.

You are invited to change some model parameters, or modify/remove/add some features and check if it has an impact on the evaluation metrics. You can change the model itself, as there are many alternatives that may work better in this scenario. You can also make the cross validation and grid search more or less granular by modifying the parameter list and/or their values. Apache Zeppelin for Denodo represents the ideal environment for this kind of tasks, that make part of day-to-day activity of a machine learning engineer, as you can perform these tests interactively and alternating code with comments and charts that let a colleague easily understand your workflow.

The final stage of the tutorial will let you learn how Denodo can distribute your models predictions through a REST data service that allows final applications to ignore the details of the algorithm used (i.e its predictors), and retrieve the predictions they need via variables that they are aware of, that is building_id and timestamp_in.