In this article we will explain the role of the Denodo Platform in the typical workflow of Data Science and Machine Learning projects. We will show how the presence of a data virtualization layer in the data architecture can enhance and speed up the typical tasks of a Data Scientist, such as data exploration, data preparation and machine learning models tuning as well as the tasks of a Data Engineer, who must address challenges like clean and formalize data preparation code for production and ensure that the predictions of the Machine Learning algorithms are easily and efficiently consumed by final users.
At a high level, once the business problem is set, the stages of a Data Science project are the following:
Image 1: Typical Data Science and Machine Learning workflow
The sequence is not linear and iterations are very common in real world projects, but let’s go through each stage and understand the value given by a virtual layer provided by the Denodo Platform.
In the early stage of a Data Science and ML project, it is common that the Business Analyst collaborates with the Data Scientist aiming at identifying a use case that brings measurable business value but that is also feasible from a technical point of view. For real business cases exploration and formulation, we need to identify the data sources to be used for Machine Learning algorithm training. These data sources may be internal, such as the Data Lake or the Data Warehouse, or external, accessible for example through data services.
If the data sources are present in the virtual layer that Denodo provides, the task of finding the right data and understanding the relationship between the various data entities is greatly simplified as the data is made available in a single location and can be queried with SQL or REST calls no matter the protocol and format of the underlying data source.
If an interesting data source is not yet present in the Denodo virtual layer, it will be quick to define it, mapping the needed metadata (field names, data types, …) and start exploring it. We should keep in mind that as Denodo does not replicate any data, the only cost of this operation is to store the data source metadata in the virtual layer that typically has little storage footprint.
To simplify the identification of useful data, the Denodo Platform includes the Data Catalog, a web application that allows Business Users and Data Scientists to access, query and find associations between all the data assets defined in the virtual layer.
These assets can be classified through tags and categories, queries can be created graphically, saved and then shared with peers for collaboration. In addition, the Data Catalog offers data lineage capabilities and enables users to perform keyword-based search on metadata and data, the latter to be previously indexed.
Image 2: Keyword-based search in the Data Catalog
Image 3: Data Lineage in the Data Catalog
There are two possible scenarios at this point, depending on the data source we want to get the data from being already available or not.
The data source is not defined in Denodo yet
We need to define the data source in Denodo: we will do that in a graphical wizard where we specify the connection protocol and the connection parameters (host name, database or schema name, user or other authentication parameters …). Once the data source is defined, we will be able to navigate the data source objects and import the metadata of those that we need, creating what we call base views.
Image 4: Data Source creation in Design Studio
The data source is already defined in Denodo
If the data source is already available, we can start working on data transformation tasks in Denodo. As an advanced SQL-engine, Denodo supports both table-level transformations, such as joins, unions, selections, filters … as well as column-level transformations, including format and types change of data coming from delimited text files, hierarchical data structures flattening (as those typically found in json files) or simply function application to a field, for example to extract the month from a date field.
Denodo has a rich library of arithmetic, text processing and date and time manipulation that can be extended with custom functions if needed. While graphical wizards are recommended in general, developers will also have an advanced code editor called VQLshell that enables them to write syntax-colored VQL sentences (VQL being the Denodo SQL extension), execute them interactively and get both the result-set and other relevant technical information, such as the execution trace.
Image 5: Using the VQL Shell in Design Studio
All these data preparation and modeling tasks will be done in the Design Studio, a web application targeted to data modelers and developers, that takes the role of centralized development environment.
We have seen so far that we can explore the organization of data assets with the Data Catalog and perform data transformations and modeling in the Design Studio. Now that we understand what is available and what are the relationships between the various tables, we can start performing data exploration and analysis to understand very important aspects of the data in our hands such as volumes, statistical distributions and data quality issues. We can do that by leveraging Apache Zeppelin for Denodo, a component of the platform that offers a web-based interactive computing environment with one-click built-in plotting capabilities.
One of the advantages of Apache Zeppelin is that it supports many interpreters (Spark, Python, R ... just to name a few among the most popular), even in the same notebook, so that you can use your preferred one. A natural choice to query a Denodo virtual database is the built-in Denodo interpreter thanks to which you’ll write, run and get results from VQL queries like any JDBC client would do.
Image 6: Data Exploration in Apache Zeppelin for Denodo
If the goal of exploratory data analysis is to gain a detailed knowledge of the data sets through visualizations, Zeppelin is the right environment to perform it as we will be able to write our own queries and customize the output as we wish; by querying the virtual layer that Denodo exposes, we will be focused on understanding the meaning of the data in our environment, while Denodo optimizer will take care of executing our queries in the best way, federating and delegating to backend databases as needed.
At the end of the data exploration step, we’ll know how the various data sets can be joined (join keys and their transformations, if needed), what are the data quality issues and those that are most urgent to fix, what are the field transformations that we can apply to obtain new features or to get the existing ones in a format suitable for the Machine Learning algorithm.
Now that we have an in-depth knowledge of the data sets, we can proceed with the final step before ML algorithm training that consists in preparing the training and test dataset. We will need to join the union of the training and test dataset, that we may see as the fact table, with all the dimensions containing fields with potential predictive power, meaning that they may help the algorithm in better modeling the target variable.
We can do that in Denodo Design Studio, which we already used earlier in the workflow to connect to data sources and transform input data as needed. In the Denodo Design Studio we can apply all the needed table-level transformations (joins, selections, aggregations, unions …) and operate at the field-level too, thanks to a rich and extensible functions library.
All the modeling steps that we will define will not need to trigger any data loading operation from the source data sets: data will be retrieved and transformed dynamically at runtime.
Image 7: Graphical join creation in Design Studio
At the end of this step, you’ll have the training and test datasets ready to be imported into your preferred Machine Learning framework.
Our training and test datasets are prepared as Denodo virtual views and ready to be fed into the Machine Learning algorithms that we want to tune and compare.
We may think of two types of machine learning frameworks based on the amount of data we are dealing with: single-machine frameworks, such as scikit-learn, for small to medium size data sets that fit in a single machine memory or distributed frameworks, such as Spark MLLib, for datasets that don’t fit in a single machine memory.
In this case we will need to retrieve and load in memory the training data from the appropriate Denodo virtual view via ODBC or JDBC protocols; once the data is loaded in memory we will not need to hit the Denodo data source again, unless we want to refresh the data. In the case we are using Python for algorithm tuning and training, there are various options to retrieve data from Denodo into Python (see reference below for more information).
In this setup, we will have two options:
Image 8: Graphical wizard to create a remote table in Denodo Design Studio
As with the data exploration step, no matter the framework, we can build our Machine Learning pipeline, from data ingestion to model tuning through cross validation and grid search in Apache Zeppelin for Denodo that supports the most popular interpreters used in the Data Science ecosystem (python, R and Spark just to mention a few) besides the default Denodo interpreter.
Image 9: Visualizing Training and Validation Errors in Apache Zeppelin for Denodo
Once the best ML algorithm and its set of optimum parameters are identified, we will need to refit it on the entire training dataset and push it to production. At this stage we have a fitted ML algorithm able to calculate predictions based on the same input features it has been run on, so that we now need to make these predictions available to final users and applications.
One of the goals of defining a virtual layer is to offer the data consumers a simple, uniform and efficient access to the data sources. The same general goal applies in Data Science workflows. Once our ML algorithm is fitted, we should design and implement the access system to its predictions. We may think of two different access patterns, both of which Denodo will help you with.
The ML algorithm is published via a REST web service that takes the values of all the input variables that the algorithm has been trained on and restitutes the prediction. As the final applications and users do not know all these input variables (their number may be huge and moreover it can vary over time), we can leverage the fact that all the data transformations steps have already been created in Denodo. We will then define a data source on this web service and another web service published by Denodo will get the search parameters sent from the application, get the input variables, pass them to the ML web service and deliver the prediction back to the application.
As an example, imagine that our ML algorithm is trained to forecast the sales of a store in a large retail company. The local sales manager is interested in knowing the forecast for her stores, so that she queries the Denodo web service with the location identifier (it can be a city, a region, …) and the range of dates she is interested in. The Denodo engine receives the request and passes the input variables to the ML web service, gets the predictions and delivers it to the end user, in our case the local sales manager.
Image 10: Architecture of a prediction publication system with a RESTful web service with data virtualization
Image 11: Querying the Denodo prediction web service from a browser
While the on-demand access sketched above allows the applications to get the ML algorithm predictions in real-time, there are some cases where you need to get a large portion of the predictions in the same query, for example when auditing the quality of algorithm predictions in the past.
In this case we need to persist the predictions of the algorithm and make them available to the users that may want to work on them. To this end, we can use the Denodo built-in capability to create remote tables into the data sources it knows about: by using this feature we are able to persist the predictions in an efficient manner and at the same time have them virtualized so that we can combine them with other views already defined in the virtual view, and finally publish them or make them available to final users. If we choose this technical path, we will need to write the predictions table in a temporary file (the most common would be a delimited file, but we are flexible to use other popular formats such as json or xml), define a base view on it and then create a remote table from a query on this view.
Remote tables are available for all the JDBC data sources and can be configured to use their respective bulk load utilities for data loading operations. This implies that Denodo can load data into remote tables efficiently also for large volumes and with SQL-on-Hadoop systems such as Spark, Impala or Presto as target.
In this article, we have given an overview of the role of the Denodo Platform in Data Science and Machine Learning projects.
We would like to summarize here the key takeaways: