Task 1 - Data Modeling in Web Design Studio (part I)

Data Science and Machine Learning

Across this tutorial, you will use the Web Design Studio for all the data preparation tasks.

In order to execute hands-on this tutorial, you need to provision a local execution environment as specified in the Environment Setup section.
Nevertheless, it is detailed step-by-step so can be followed just by reading it.

NOTE

If you prefer, you can use the Virtual Data Port Administration Tool whenever we use the Web Design Studio.

NOTE

The Web Design Studio is a web application that allows the developer to connect to data sources, define logical views to their data (base views in the Denodo nomenclature), transform them with relational operations (projections, unions, joins ...) and organize them in a meaningful way in databases and folder-based hierarchies. It is very similar in functionality to other popular SQL clients such as DBeaver, SQLDeveloper or MySQL Workbench. To access the Web Design Studio, open your browser and go to the following link:

http://denodo.dstutorial.com:9090/denodo-design-studio/?uri=//localhost:9999/#/

Login: dstutorial_usr
Password: dstutorial_usr

Once logged-in, In the left-hand side you will see the Server Explorer with two databases, dstutorial and dstutorial_sample_completed . You will work exclusively in the former and use the latter only if you want to go directly to the solution for a given task. If you click on the ( ) icon you will expand (collapse) the database. Do that for dstutorial and you’ll see that the folder structure is already created for you.

If you drill-down to 01_connectivity/01_data_sources you will see the following folders:

  • building: this folder stores the data source definition for the building locations (ds_building_location) and other building metadata (ds_building)
  • date: this folder stores the data source definition pointing to the holidays web service (ds_holidays)
  • meter_reading: this folder stores the data source definition for the meter readings, the target variable that we want to predict (ds_meter_reading)
  • weather: this folder stores:
    • the data source definition for weather data provided in the contest (ds_weather)
    • an additional folder, ghcn, that stores the data sources to files downloaded from the National Oceanic and Atmospheric Administration ds_site{00,02,04,13,15}_{monthly,daily}.

The same folder structure is found also under 01_connectivity/02_base_views, but instead of containing the data sources, the sub-folders contain the base views, that is metadata-only views on the data we are interested in.

If you open for example the view bv_meter_reading_test in folder /01_connectivity/02_base_views/meter_reading (you can open it with a double click), you’ll see the view definition (column names, types, descriptions), and Denodo-related metadata (creation and last modification time, folder, data source, database …).

One of the most basic actions you can perform on Denodo views is to see their data: you can do that by clicking on Execution panel () then Execute ():

The Denodo engine generates the SQL-like query (VQL, Virtual Query Language in Denodo), and sends to the datasource the corresponding query that enables it to provide the result set that you want to obtain in the virtual layer. The data source then sends back the results to Denodo, which displays a sample for the user.

Open the view bv_building_location, have a look at its metadata then execute it.

  • How many records does bv_building_location have?
  • What is the value of site_id corresponding to Charlottesville county?
EXERCISE

Once the data is abstracted in Denodo, that is a data source and a base view are defined, you can query it with a unified query language, the already introduced VQL, no matter the underlying data source supports SQL, partially or at all, or supports other types of query languages. Again with the bv_building_location view, we may be interested to know how many buildings per state are there. To do that, you can open a VQLShell (Tools->VQLShell) and write down your VQL query:

SELECT COUNT(*) AS n_buildings, state FROM bv_building_location GROUP BY state ORDER BY n_buildings DESC;

We should highlight that what we just showed is that we are analysing data in a remote Excel file using SQL-like syntax. That's powerful!

Let's now focus on integrating into the data preparation workflow the GHCN (Global Historical Climatology Network) data files. The ratio is that, even if we already have some weather information (to get an idea of what we already have you can open and run the views bv_weather_test and bv_weather_train), we may find that GHCN data can improve the accuracy of our prediction model by adding some more advanced meteorological indicator (ex. weather type, snowfall, snow depth) measured in the closest stations to the building and with two different aggregation periods (daily and monthly). We have downloaded the data from the NOAA website and stored them in CSV files in a SFTP server hsoted at sftp.dstutorial.com.

Out of the five sites for which the location is known (you can get this list by running: SELECT site_id,county FROM iv_building_location WHERE location_confirmed = 'Y' AND state = 'United States'), we just miss the monthly data for Cornell University (site 15).

To integrate this file in the data pipeline, we are going to perform the following steps:

  • Create a data source called ds_site15_monthly that points to a csv file.
  • Create a base view called bv_site15_monthly on top of ds_site15_monthly.
  • Create a derived view called iv_site15_monthly as a projection from bv_site15_monthly to add a field indicating the origin of the records, as we are going to append this data to all other sites data.
  • Create a derived view iv_site15_monthly_01_partitioned as a selection of iv_site15_monthly with filter site = '15'. This allows the Denodo optimizer to prune this branch when appropriate.
  • Add this view iv_site15_monthly_01_partitioned to the union view called iv_all_sites_weather_monthly that already contains the monthly data from the other sites.
  • Generate the statistics of iv_all_sites_weather_monthly. Discuss and assess data quality.
  • Create an aggregation view on iv_all_sites_weather_monthly called iv_all_sites_weather_monthly_grp_by_site. This view will be used in the subsequent steps of the data integration flow.

Let’s get started with the creation of the data source. Go to /01_connectivity/01_data_sources/weather/ghcn>/, click on , then New->Datasource->Delimited File.

Fill in the data source parameters:

Name ds_site15_monthly
Data route FTP / SFTP / FTPS Client
Login sftpuser
Password sftpuser
URL sftp://sftp.dstutorial.com//data/dstutorial/ghcn_weather/site15_monthly.csv
Column delimiter ,
Header YES
Ignore matching errors NO

After saving, test the connection. You should get a success message.

We are now able to create the base view bv_site15_monthly by clicking on Create Base View () in the datasource ds_site15_monthly just created. When doing so, Denodo creates a base view called ds_site15_monthly in the same folder as the datasource. Let’s modify its name to bv_site15_monthly and move it to /01_connectivity/02_base_views/weather/ghcn. To do that, click on edit and change the property View name to bv_site15_monthly. Also, starting from field cdsd downwards, change the type of all fields without the suffix _attributesto decimal. This is done in the View Schema panel, and changing the Field Type. When you are done, move to the Metadata tab and change the folder as specified above.

One last action is needed before saving, that is to compute the statistics of the view. The statistics of a view consist in a few important indicators about its data: number of records, minimum and maximum values for numerical fields, number of distinct values and number of missing values. These numbers are essential for the cost-based optimizer to identify the most efficient execution plan. It is in general recommended to compute and maintain statistics for all the base views, so let’s do that for bv_site15_monthly. In the view configuration panel, click on OPTIONS->Statistics, then Enable Statistics. All the fields are automatically selected, now click on Gather statistics for selected fields. The statistics are calculated and displayed.

Now, back to the SUMMARY tab, you can inspect the metadata and execute the view, as we did before.

Next step is to define a projection view, with the definition of a constant field site = '15'. Select the base view just created, then New->Selection. A new tab will be displayed for the projection view creation. Specify the following values for the required properties:

Name iv_site15_monthly
Folder /02_integration/weather/ghcn/monthly

To define a new field, go to the Output tab, click on New->New Field, then define the new field in the pop-up window.

Then save. The projection view is defined and executable.

In a very similar way, let's go on with the creation of view iv_site15_monthly_01_partitioned.

Select the view iv_site15_monthly then New->Selection. Now, no fields will be added; instead, we will define a filter on the newly created field, site. This filter is defined in the Where Conditions tab of the view.

In the Output tab you can set the name iv_site15_monthly_01_partitioned, then save.

It is time now to append the GHCN data from site 15, contained in the views just created, to the union that contains the data from all other sites.

This union view is called iv_all_sites_weather_monthly and is defined as the union of the following views:

  • iv_site00_monthly_01_partitioned
  • iv_site02_monthly_01_partitioned
  • iv_site04_monthly_01_partitioned
  • iv_site13_monthly_01_partitioned

Now, we are going to add iv_site15_monthly_01_partitioned to the union.

Let us open iv_all_sites_weather_monthly and move to the Model (extended union) panel, where we can drag and drop the newly created view iv_site15_monthly_01_partitioned. You'll notice that Denodo will automatically create the associations for fields with matching names. If a field does not exist in one of the input views, that will be nonetheless included in the union view, with missing values where appropriate.

After dragging and dropping the view in the model area, you can now save the modified union view.

We recommend now running a quick data check, that is a count on the union view that should return 6558 records and 71 columns.

To run this check, go to Execution Panel of the view iv_all_sites_weather_monthly, check the box Retrieve all rows, then execute the query:

As the final step of our GHCN files data transformation flow, we are going to aggregate the data in the view that we have just modified by site and date, with average as the aggregation measure. However, we are not blindly selecting all the numerical indicators as aggregation candidates. Instead, we will take into account only the indicators that show a percentage of missing values lower than a given threshold. The aggregation and numerical indicators selection operations are motivated mainly by two reasons:

  • Aggregation ensures preserving the cardinality of the training table, to be enriched with GHCN data. As we are going to join the GHCN data with the training table by site and date, we must take care of avoiding duplicates. Aggregating on these fields is the simplest solution to achieve that.
  • If you had a look at the data, you would have seen that there a lot of missing values in GHCN weather indicators. By aggregating and selecting the indicators with a low percent of missing values, we are hoping to increase the overall quality of the data while keeping the number of indicators as low as possible (low complexity). The intuitive justification would be that it is sufficient to have a data record from just a weather station, instead of requiring all or most of them to have a valid value to keep it.

To this end, let's calculate the statistics for iv_all_sites_weather_monthly. Open the view, then Options->Statistics. On the top-left corner you will enable the switch Enable Statistics and then launch the statistics gathering process by clicking on Gather statistics for selected fields

If you have a look at the last column, it seems that there are a lot of missing values for the numerical indicators, so that we shouldn't set a very low value for the cutoff missing percentage, let's say 70%. The indicators with at least 30% of not-null values will survive.

To get the list of these indicators, we can leverage a Denodo stored procedure called GET_VIEW_STATISTICS that enables access to views statistics via SQL syntax:

SELECT field_name FROM GET_VIEW_STATISTICS() WHERE input_database_name = 'dstutorial' AND input_name = 'iv_all_sites_weather_monthly' AND null_values/cast(rows_number as decimal) >0.7 AND field_name NOT LIKE '%_attributes' AND field_name NOT IN ('site', 'station','name' , 'latitude' , 'longitude', 'date', 'elevation') ORDER BY field_name

Copy and paste it in the VQLShell, then run it. It will return the following list of indicators:


dp01
dp10
dsnw
emsn
emxp
prcp
snow

To create the aggregation view, select the view iv_all_sites_weather_monthly, then New->Selection. In the wizard, go to the Group By tab and select the fields site and date. Then switch to Output tab, rename the view to iv_all_sites_weather_monthly_grp_by_site and create a new aggregate field (New->New Aggregate Field)

Repeat this action with all the fields in the list above. Your aggregated view will have a total of 9 columns.

Now, let us review what we have done so far. We have virtualized data of a specific building site coming from a remote CSV reachable via SFTP, we have then merged this data with the data coming from all other sites and applied an aggregation operation to put the data in a format that can be joined later with the rest of the data available for the analysis. The reasons for applying the aggregation are outlined above and all these data preparation steps are motivated by the hope of finding powerful predictors for the modeling task in GHCN data.

The data transformation flow that we have built can be seen graphically in the Tree View of the final aggregation view. To display it, open the view and click on Tree View(): in the displayed tree the nodes represent either views or transformations. The nodes can be expanded and collapsed and the tree can be zoomed and zoomed out.

The Tree View is available for all the views and it is a great way to visualize the data transformation logic in a single diagram.

One additional feature that we would like to highlight is the advanced query rewriting capabilities of the Denodo optimizer. More specifically, let us see how the Denodo engine rewrites the query in the case it includes a filter on specific sites, say site in (0,15) (these correspond to University of Central Florida, Orlando and Cornell University, Ithaca, respectively; recall that site-to-location mapping is always available in view iv_building_location). To do that let's prefix that query with the DESC QUERYPLAN instruction, that simply allows generating the execution plan without actually executing the query, and then inspect the plan generated.

Open a VQLShell, write and run the following query:

DESC QUERYPLAN SELECT * FROM iv_all_sites_weather_monthly_grp_by_site WHERE site IN (0,15);

If you click on the top node, you'll see that the Denodo optimizer has applied Branch Pruning. This optimization ensures that the files where the application of the condition site in (0,15) will surely filter out all the records, are not read at all, leading to better execution plans and thus better performance.

As an example compare the above execution plan and the execution plan of an equivalent not-prunable query

DESC QUERYPLAN SELECT * FROM dstutorial_sample_completed.iv_all_sites_weather_monthly_grp_by_site_noopt WHERE site IN (0,15);

We have now ended our data transformation flow that represented the missing branch to complete the GHCN monthly data transformation pipeline. We have accomplished that using exclusively the Web Design Studio, the platform interface for data modeling and development, and leveraging built-in graphical wizard for the different operations (selections, filter, union and aggregation). The transformation logic, that has been applied in stacked layers of derived views, can be graphically displayed at each transformation node by opening the corresponding Tree View. We have also mentioned the branch pruning optimization that the Denodo engine applies when some filtering conditions are met.

Let us now move on to the Data Catalog, a component of the Denodo platform that exposes all the views and web services defined in the Denodo virtual databases in a web application. In this web application business users will be able to graphically search, real-time query, combine and organize with tags and categories the organization data assets.