You can translate the document:

Introduction

In this article we will explain the role of the Denodo Platform in the typical workflow of Data Science and Machine Learning projects. We will show how the presence of a data virtualization layer in the data architecture can enhance and speed up some of the most common tasks of a Data Scientist, such as data exploration, data preparation and machine learning models tuning, as well as the tasks of a Data Engineer, who must address challenges like cleaning and formalizing data preparation code for production, and ensuring that the predictions of the Machine Learning algorithms are easily and efficiently consumed by final users.

At a high level, once the business problem is defined, the stages of a Data Science project are the following:

  • Stage 1: Identification of relevant data
  • Stage 2: Data connection and format adaptation
  • Stage 3: Data exploration and analysis
  • Stage 4: Data preparation for Machine Learning algorithms (feature engineering)
  • Stage 5: Machine Learning algorithms tuning, deployment and training
  • Stage 6: Results publication and sharing with business users

Image 1: Typical Data Science and Machine Learning workflow

The sequence is not linear, and iterations are very common in real world projects. In this document we will go through each stage, and understand the value given by a virtual layer provided by the Denodo Platform.

Identification of relevant data

In the earlier stages of a Data Science and ML project, the Business Analyst collaborates with the Data Scientist, aiming at identifying a use case that brings measurable business value, while remaining feasible from a technical implementation standpoint. To investigate and develop genuine business scenarios, it is necessary to identify the data sources to be used for the Machine Learning algorithm training. These data sources may be internal, such as a Data Lake or a Data Warehouse, or external, accessible through data services.

Whenever the data sources are present in the virtual layer that Denodo provides, the task of finding the right data, and understanding the relationship between the various data entities, is greatly simplified, as the data is made available in one single location, and can be queried with SQL or REST calls, regardless of the protocol and format of the underlying data source.

If a potentially interesting data source is not yet present in the Denodo virtual layer, incorporating its definition will be a quick and easy process, mapping the needed metadata (field names, data types, …) to start exploring it. It is important to remark that Denodo does not replicate any data, the only cost of this operation is to store the data source metadata in the virtual layer, which has a minimum storage footprint.

To simplify the identification of useful data, the Denodo Platform includes the Data Catalog, a web application that allows Business Users and Data Scientists to access, query and find associations between all the data assets defined in the virtual layer.

These assets can be classified through tags and categories, queries can be created graphically, saved and then shared with peers for collaboration. In addition, the Data Catalog offers data lineage capabilities, and enables users to perform keyword-based search on metadata and data content, the latter to be previously indexed.

Image 2: Keyword-based search in the Data Catalog

Image 3: Data Lineage in the Data Catalog

Data connection and format modification

There are two possible scenarios at this point, depending on whether the data source we want to get the data from is already available or not.

The data source is not defined in Denodo yet 

We need to define the data source in Denodo: we will do that in a graphical wizard, where we specify the connection protocol and the connection parameters (host name, database or schema name, user or other authentication parameters, etc…). Once the data source is defined, we will be able to navigate through the data source objects, and import the metadata of those that we need, creating what in Denodo is described as  base view.

Image 4: Data Source creation in Design Studio

The data source is already defined in Denodo 

If the data source is already available within Denodo, we are ready to start working on data transformation tasks. Denodo supports both table-level transformations, such as joins, unions, selections, filters … as well as column-level transformations, including format and type changes of data coming from delimited text files, flattening of hierarchical data structures  (for example, from the ones  found in json files), or simply function applications to a field, for example, to extract the month from a date.

In addition, Denodo is equipped with a rich library of arithmetic, text processing and date and time manipulation functions, which can be further extended with custom functions if needed. While graphical wizards are recommended in general, developers will also have an advanced code editor called VQLshell, that enables them to write syntax-highlighted VQL (Virtual Query Language, the SQL-like declarative language used within Denodo) sentences , execute them interactively, and get both the results-set, and other relevant technical informations, such as the execution trace.

Image 5: Using the VQL Shell in Design Studio

All these data preparation and modeling tasks will be done in Denodo Design Studio, a web application, targeted to data modelers and developers, that takes the role of centralized development environment.

Data exploration and analysis

So far we have seen that we can explore the organization of data assets with Denodo Data Catalog and perform data transformations and modeling in Denodo Design Studio. Once we understand what is available, and what are the relationships between the various entities, we can move on to performing data exploration and analysis, to determine   the most relevant aspects of the data in our hands, such as quantitative volumes, statistical distribution of the data sets, and data quality issues. We can do that by leveraging Apache Zeppelin for Denodo, a component of the platform that offers a web-based interactive computing environment with one-click built-in plotting capabilities.

One of the advantages of Apache Zeppelin is the out-of-the-box support for many interpreters (Spark, Python, R ... just to name a few among the most popular), integrated within the same notebook simultaneously, in order for Data scientists and developers to use their preferred ones, combining with cells using Denodo VQL, a natural choice to query a Denodo virtual database, also supported with a built-in Denodo interpreter. In this way the notebook users will also be able to  write, run and get results from VQL queries, exactly like any JDBC client would do, combined with other interpreters of their choice.

Image 6: Data Exploration in Apache Zeppelin for Denodo

Since one main goal of Exploratory Data Analysis (EDA) is the gain of deep knowledge of the data sets through different perspectives and visualizations, Zeppelin is the right tool to perform it. We will be able to write our own queries and customize the output as we wish. By querying the semantic virtual layer that Denodo exposes, we will be focused on understanding the meaning of the data in our environment, while Denodo optimizer will take care of executing our queries in the best way, federating and delegating to backend databases as needed.

At the end of the data exploration phase, we’ll have gained critical understanding of  the nature of the various data sets, and determine how can they be best combined (for instance, which join keys and their transformations, if needed), what data quality issues could be most endangering for the final results that are worth fixing, what are the field transformations we can apply to obtain new features, or to get the existing ones in a format suitable for the Machine Learning algorithm.

Data preparation for Machine Learning supervised algorithms

In supervised machine learning, where a training phase is necessary, it's crucial to prepare both the training and test datasets. Combining these datasets could be as easy as unioning them, essentially creating a comprehensive fact table. This table incorporates all relevant dimensions, each containing fields with potential predictive power. These fields assist the algorithm in effectively modeling the target variable.

We can do that in Denodo Design Studio, which we already used earlier in the workflow to connect to data sources and transform input data as needed. In the Denodo Design Studio we can apply all the needed table-level transformations (joins, selections, aggregations, unions …) and operate at the field-level too, thanks to a rich and extensible functions library.

All the modeling steps that we will define will not need to trigger any data loading operation from the source data sets: data will be retrieved and transformed dynamically at runtime.

Image 7: Graphical join creation in Design Studio

At the end of this phase, you’ll have the training and test datasets ready to be used by your preferred Machine Learning framework.

Machine Learning algorithm tuning, deployment and training

Our training and test datasets are prepared as Denodo virtual views, and ready to be fed into the Machine Learning algorithms targeted for tuning and benchmarking.

We may consider two types of machine learning frameworks, based on the amount of data we are dealing with: single-machine frameworks, such as scikit-learn, for small to medium size data sets that fit in a single machine memory; or distributed frameworks, such as Spark MLLib, for datasets that don’t fit in a single machine memory.

Single-machine frameworks:

In this case we will need to retrieve and load (in memory) the training data from the appropriate Denodo virtual view, via ODBC or JDBC protocols; once the data is loaded in memory, we will not need to hit the Denodo data source again, unless we want to refresh the data. In case we are using Python for algorithm tuning and training, there are various options to retrieve data from Denodo into Python (see reference below for more information).

Distributed frameworks:

In this setup, we will have two options:

  • The first option would be to create a remote table in Denodo from the derived view to your Massive Parallel Processing cluster (MPP cluster, in the example above it would be a Spark cluster) and then point directly to that data set from your Machine Learning algorithm tuning code; for data loading into remote tables, Denodo will make use of the native Bulk Load API of the target database. In particular for SQL-on-Hadoop engines, it will create Parquet files in chunks and load them in parallel.

Image 8: Graphical wizard to create a remote table in Denodo Design Studio

  • The second option would consist of directly querying the appropriate Denodo view thus leveraging the Denodo native optimization capabilities of MPP query acceleration and cache co-location to maximize performance. More information and examples about these capabilities can be found in the references section; in short, MPP query acceleration refers to the ability of the Denodo optimizer to move the workload on available MPP systems to improve the performance, while cache co-location allows Denodo to maximize query pushdown when combining cached data with other datasets in the same data source. The two techniques can be combined to give even more performance benefits.

Similar to the data exploration phase, irrespective of the framework chosen, we can construct our Machine Learning pipeline in Apache Zeppelin for Denodo. This pipeline encompasses various stages from data ingestion to model tuning, including cross-validation and grid search. Apache Zeppelin supports a wide array of interpreters commonly used in the Data Science ecosystem, such as Python, R, and Spark, in addition to the default Denodo interpreter.

Image 9: Visualizing Training and Validation Errors in Apache Zeppelin for Denodo

Once the best ML algorithm is determined and its set of optimum parameters are identified, we will need to refit it on the entire training dataset and push it to production. At this stage  we have a fitted ML algorithm, able to calculate predictions based on the same input features it has been run on. The next step is to make these predictions available to final users and applications.

Publication and sharing of results with business users

One of the goals of defining a virtual layer is to offer the data consumers a simple, uniform and efficient access to the data sources. The same general goal applies in Data Science workflows. Once our ML algorithm is fitted, we should design and implement the access system to its predictions. We may think of two different access patterns, both of which Denodo will help you with.

On-demand access

The machine learning algorithm is made accessible through a REST web service, accepting input values corresponding to the variables it was trained on and returning a prediction. Since the final applications and users do not know all these input variables (their number may be huge, and moreover it can vary over time), we can leverage the fact that all the data transformation steps have already been created in Denodo. We will then define a data source on this web service, and another web service published by Denodo will: get the search parameters sent from the application, get the input variables, pass them to the ML web service, and deliver the prediction back to the application.

As an example, imagine that our ML algorithm is trained to forecast the sales of a store in a large retail company. The local sales manager is interested in knowing the forecast for her stores, so she queries the Denodo web service with the location identifier (it can be a city, a region, …) and the range of dates she is interested in. The Denodo engine receives the request and passes the input variables to the ML web service, gets the predictions and delivers it to the end user, in our case the local sales manager.

Image 10: Architecture of a prediction publication system with a RESTful web service with data virtualization

Image 11: Querying the Denodo prediction web service from a browser

Batch access

While the on-demand access sketched above allows the applications to get the ML algorithm predictions in real-time, there are some cases where you need to get a large portion of the predictions in the same query, for example when auditing the quality of algorithm predictions in the past.

In this case we need to persist the predictions of the algorithm and make them available to the users that may want to work on them. To this end, we can use the Denodo built-in capability to create remote tables into the data sources it knows about. By using this feature we are able to  persist the predictions in an efficient manner, and combine them with other views already defined in the virtual view, to finally publish them, or make them available to final users. If we choose this approach, we will need to write the predictions table in a temporary file (the most common would be a delimited file, but we have many other popular formats available, such as json or xml), define a base view on it and then create a remote table from a query on this view.

Remote tables are available for all the JDBC data sources, and can be configured to use their respective bulk load utilities for data loading operations. This implies that Denodo can load data into remote tables efficiently also for large volumes and with SQL-on-Hadoop systems such as Spark, Impala or Presto as target.

Conclusion

In this article, we have given an overview of the role of the Denodo Platform in Data Science and Machine Learning projects.

We would like to summarize here the key takeaways:

  • Data Scientists and Business Analysts, thanks to the Data Catalog, will be able to collaborate and explore all their organization data assets; create, run and share queries; analyze tables relationships and data lineage.
  • Data Scientists, thanks to Apache Zeppelin for Denodo, will be able to perform their interactive data analysis in the language they prefer: SQL (Denodo), Python, R and Spark just to mention the most popular. They will do that by hitting only one data source, that is the Denodo virtual database, thus eliminating time-consuming and complex multi-source connectivity configuration and code.
  • Data Scientists, in their preferred development environment, will be able to run their Machine Learning pipeline either fetching training data directly from Denodo, or from their distributed framework, with data efficiently loaded from a Denodo remote table.
  • Data Scientists and Data Engineers will enjoy the Denodo data modeling capabilities: familiar SQL syntax, table-level and field-level transformations, graphical modeling, visualization of data lineage and tree view of the data transformation flow.
  • Data Engineers will rapidly fulfill requests for new data availability thanks to Denodo Platform broad connector choices and the absence of data replication.
  • Data Engineers will optimize performance for highly interactive query patterns commonly encountered during data exploration using the advanced query optimizer available in the Denodo Platform.
  • Data Engineers will leverage the Denodo Platform's data publication capabilities to provide flexible and efficient access for data consumers to ML-algorithm predictions.

References

Query optimization in Denodo

Test Drive: Data Catalog and Data Science on AWS

Using Notebooks for Data Science with Denodo

Apache Zeppelin for Denodo - User Manual

How to connect to Denodo from Python - a starter for Data Scientists

Advanced Analytics and Machine Learning with Data Virtualization (Webinar)

Parallel Processing

How to configure MPP Query Acceleration in Denodo

Remote Tables

Disclaimer
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here