Across this tutorial, you will use the Web Design Studio for all the data preparation tasks.
The Web Design Studio is a web application that allows the developer to connect to data sources, define logical views to their data (base views in the Denodo nomenclature), transform them with relational operations (projections, unions, joins ...) and organize them in a meaningful way in databases and folder-based hierarchies. It is very similar in functionality to other popular SQL clients such as DBeaver, SQLDeveloper or MySQL Workbench. To access the Web Design Studio, open your browser and go to the following link:
Once logged-in, In the left-hand side you will see the Server Explorer with two databases,
. You will work exclusively in the former and use the latter only if you want to go directly to the solution for a given task. If you click on the
) icon you will expand (collapse) the database. Do that for
and you’ll see that the folder structure
is already created for you.
If you drill-down to
01_connectivity/01_data_sources you will see the following folders:
ds_building_location) and other building metadata (
ghcn, that stores the data sources to files downloaded from the National Oceanic and Atmospheric Administration
The same folder structure is found also under
01_connectivity/02_base_views, but instead of containing the data sources, the sub-folders contain the base views, that is metadata-only views on the data we are interested in.
If you open for example the view
bv_meter_reading_test in folder
/01_connectivity/02_base_views/meter_reading (you can open it with a double click), you’ll see the view definition (column names, types, descriptions), and Denodo-related metadata (creation and last modification time, folder, data source, database …).
One of the most basic actions you can perform on Denodo views is to see their data: you can do that by clicking on Execution panel () then Execute ():
The Denodo engine generates the SQL-like query (VQL, Virtual Query Language in Denodo), and sends to the datasource the corresponding query that enables it to provide the result set that you want to obtain in the virtual layer. The data source then sends back the results to Denodo, which displays a sample for the user.
Once the data is abstracted in Denodo, that is a data source and a base view are defined, you can query it with a unified query language, the already introduced VQL, no matter the underlying data source supports SQL, partially or at all, or supports other types of query languages. Again with the
bv_building_location view, we may be interested to know how many buildings per state are there. To do that, you can open a VQLShell (Tools->VQLShell) and write down your VQL query:
SELECT COUNT(*) AS n_buildings, state FROM bv_building_location GROUP BY state ORDER BY n_buildings DESC;
We should highlight that what we just showed is that we are analysing data in a remote Excel file using SQL-like syntax. That's powerful!
Let's now focus on integrating into the data preparation workflow the GHCN (Global Historical Climatology Network) data files. The ratio is that, even if we already have some weather information (to get an idea of what we already have you can open and run the views
bv_weather_train), we may find that GHCN data can improve the accuracy of our prediction model by adding some more advanced meteorological indicator (ex. weather type, snowfall, snow depth) measured in the closest stations to the building and with two different aggregation periods (daily and monthly). We have downloaded the data from the NOAA website and stored them in CSV files in a SFTP server hsoted at
Out of the five sites for which the location is known (you can get this list by running:
SELECT site_id,county FROM iv_building_location WHERE location_confirmed = 'Y' AND state = 'United States'), we just miss the monthly data for Cornell University (site 15).
To integrate this file in the data pipeline, we are going to perform the following steps:
ds_site15_monthlythat points to a csv file.
bv_site15_monthlyon top of
iv_site15_monthlyas a projection from
bv_site15_monthlyto add a field indicating the origin of the records, as we are going to append this data to all other sites data.
iv_site15_monthly_01_partitionedas a selection of
site = '15'. This allows the Denodo optimizer to prune this branch when appropriate.
iv_site15_monthly_01_partitionedto the union view called
iv_all_sites_weather_monthlythat already contains the monthly data from the other sites.
iv_all_sites_weather_monthly. Discuss and assess data quality.
iv_all_sites_weather_monthly_grp_by_site. This view will be used in the subsequent steps of the data integration flow.
Let’s get started with the creation of the data source. Go to
/01_connectivity/01_data_sources/weather/ghcn>/, click on , then New->Datasource->Delimited File.
Fill in the data source parameters:
|Ignore matching errors||
After saving, test the connection. You should get a success message.
We are now able to create the base view
bv_site15_monthly by clicking on Create Base View () in the datasource
ds_site15_monthly just created. When doing so, Denodo creates a base view called
ds_site15_monthly in the same folder as the datasource. Let’s modify its name to
bv_site15_monthly and move it to
/01_connectivity/02_base_views/weather/ghcn. To do that, click on edit and change the property View name to
bv_site15_monthly. Also, starting from field
cdsd downwards, change the type of all fields without the suffix
decimal. This is done in the View Schema panel, and changing the Field Type. When you are done, move to the Metadata tab and change the folder as specified above.
One last action is needed before saving, that is to compute the statistics of the view. The statistics of a view consist in a few important indicators about its data: number of records, minimum and maximum values for numerical fields, number of distinct values and number of missing values. These numbers are essential for the cost-based optimizer to identify the most efficient execution plan. It is in general recommended to compute and maintain statistics for all the base views, so let’s do that for
bv_site15_monthly. In the view configuration panel, click on OPTIONS->Statistics, then Enable Statistics. All the fields are automatically selected, now click on Gather statistics for selected fields. The statistics are calculated and displayed.
Now, back to the SUMMARY tab, you can inspect the metadata and execute the view, as we did before.
Next step is to define a projection view, with the definition of a constant field
site = '15'. Select the base view just created, then New->Selection. A new tab will be displayed for the projection view creation. Specify the following values for the required properties:
To define a new field, go to the Output tab, click on New->New Field, then define the new field in the pop-up window.
Then save. The projection view is defined and executable.
In a very similar way, let's go on with the creation of view
Select the view
iv_site15_monthly then New->Selection. Now, no fields will be added; instead, we will define a filter on the newly created field,
site. This filter is defined in the Where Conditions tab of the view.
In the Output tab you can set the name
iv_site15_monthly_01_partitioned, then save.
It is time now to append the GHCN data from site 15, contained in the views just created, to the union that contains the data from all other sites.
This union view is called
iv_all_sites_weather_monthly and is defined as the union of the following views:
Now, we are going to add
iv_site15_monthly_01_partitioned to the union.
Let us open
iv_all_sites_weather_monthly and move to the Model (extended union) panel, where we can drag and drop the newly created view
iv_site15_monthly_01_partitioned. You'll notice that Denodo will automatically create the associations for fields with matching names. If a field does not exist in one of the input views, that will be nonetheless included in the union view, with missing values where appropriate.
After dragging and dropping the view in the model area, you can now save the modified union view.
We recommend now running a quick data check, that is a count on the union view that should return 6558 records and 71 columns.
To run this check, go to Execution Panel of the view
iv_all_sites_weather_monthly, check the box Retrieve all rows, then execute the query:
As the final step of our GHCN files data transformation flow, we are going to aggregate the data in the view that we have just modified by site and date, with average as the aggregation measure. However, we are not blindly selecting all the numerical indicators as aggregation candidates. Instead, we will take into account only the indicators that show a percentage of missing values lower than a given threshold. The aggregation and numerical indicators selection operations are motivated mainly by two reasons:
To this end, let's calculate the statistics for
iv_all_sites_weather_monthly. Open the view, then Options->Statistics. On the top-left corner you will enable the switch Enable Statistics and then launch the statistics gathering process by clicking on Gather statistics for selected fields
If you have a look at the last column, it seems that there are a lot of missing values for the numerical indicators, so that we shouldn't set a very low value for the cutoff missing percentage, let's say 70%. The indicators with at least 30% of not-null values will survive.
To get the list of these indicators, we can leverage a Denodo stored procedure called
GET_VIEW_STATISTICS that enables access to views statistics via SQL syntax:
SELECT field_name FROM GET_VIEW_STATISTICS() WHERE input_database_name = 'dstutorial' AND input_name = 'iv_all_sites_weather_monthly' AND null_values/cast(rows_number as decimal) >0.7 AND field_name NOT LIKE '%_attributes' AND field_name NOT IN ('site', 'station','name' , 'latitude' , 'longitude', 'date', 'elevation') ORDER BY field_name
Copy and paste it in the VQLShell, then run it. It will return the following list of indicators:
To create the aggregation view, select the view
iv_all_sites_weather_monthly, then New->Selection. In the wizard, go to the Group By tab and select the fields site and date. Then switch to Output tab, rename the view to
iv_all_sites_weather_monthly_grp_by_site and create a new aggregate field (New->New Aggregate Field)
Repeat this action with all the fields in the list above. Your aggregated view will have a total of 9 columns.
Now, let us review what we have done so far. We have virtualized data of a specific building site coming from a remote CSV reachable via SFTP, we have then merged this data with the data coming from all other sites and applied an aggregation operation to put the data in a format that can be joined later with the rest of the data available for the analysis. The reasons for applying the aggregation are outlined above and all these data preparation steps are motivated by the hope of finding powerful predictors for the modeling task in GHCN data.
The data transformation flow that we have built can be seen graphically in the Tree View of the final aggregation view. To display it, open the view and click on Tree View(): in the displayed tree the nodes represent either views or transformations. The nodes can be expanded and collapsed and the tree can be zoomed and zoomed out.
The Tree View is available for all the views and it is a great way to visualize the data transformation logic in a single diagram.
One additional feature that we would like to highlight is the advanced query rewriting capabilities of the Denodo optimizer. More specifically, let us see how the Denodo engine rewrites the query in the case it includes a filter on specific sites, say site in (0,15) (these correspond to University of Central Florida, Orlando and Cornell University, Ithaca, respectively; recall that site-to-location mapping is always available in view iv_building_location). To do that let's prefix that query with the
DESC QUERYPLAN instruction, that simply allows generating the execution plan without actually executing the query, and then inspect the plan generated.
Open a VQLShell, write and run the following query:
DESC QUERYPLAN SELECT * FROM iv_all_sites_weather_monthly_grp_by_site WHERE site IN (0,15);
If you click on the top node, you'll see that the Denodo optimizer has applied Branch Pruning. This optimization ensures that the files where the application of the condition site in (0,15) will surely filter out all the records, are not read at all, leading to better execution plans and thus better performance.
As an example compare the above execution plan and the execution plan of an equivalent not-prunable query
DESC QUERYPLAN SELECT * FROM dstutorial_sample_completed.iv_all_sites_weather_monthly_grp_by_site_noopt WHERE site IN (0,15);
We have now ended our data transformation flow that represented the missing branch to complete the GHCN monthly data transformation pipeline. We have accomplished that using exclusively the Web Design Studio, the platform interface for data modeling and development, and leveraging built-in graphical wizard for the different operations (selections, filter, union and aggregation). The transformation logic, that has been applied in stacked layers of derived views, can be graphically displayed at each transformation node by opening the corresponding Tree View. We have also mentioned the branch pruning optimization that the Denodo engine applies when some filtering conditions are met.
Let us now move on to the Data Catalog, a component of the Denodo platform that exposes all the views and web services defined in the Denodo virtual databases in a web application. In this web application business users will be able to graphically search, real-time query, combine and organize with tags and categories the organization data assets.