In the previous session we have walked through the creation of a data transformation flow to read and integrate the GHCN monthly data from a specific site.
The focus of this session is to understand on how we can leverage the Data Catalog to:
You start by signing in to the Data Catalog application. In your browser open the following url:
Then enter the username and password as specified below, then click on Sign In
Virtual DataPort Server:
localhost - dstutorial
As first step, define the categories
ghcn - monthly and
ghcn - daily with parent category
weather. In the top menu bar, click on Administration->Catalog Management->Categories, the Category Management window is opened. There are already two categories defined,
weather is the parent category of the ones to be defined.
Click on the New Category icon () then fill in the information as shown:
Proceed similarly with the category
ghcn - monthly.
At the end of this step you should have the following categories structure:
Now go on with the creation of a tag, called
predict-building-consumption. Click on Administration->Catalog Management->Tags
to open the Tag Management menu. Once there, click on the New Tag icon (), then fill in the information as shown below (as you can see the description can be formatted in HTML).
As no tags were already defined, the Tag Management menu should look as the following after the tag creation:
Now you can associate the views to the categories and tag. The process is the same for tags and categories, so we are showing how to complete it for a category and you can complete the remaining ones on your own.
Go back to the Category Management menu (Administration->Content Management->Categories) then click on the edit icon () of the category that you want to modify, then click on the button
Add from Database/Folders tree. A new window will be displayed in which you can search and select the items that you want to associate to the category: for example for category
ghcn - daily, looking for
daily allows you to browse for the relevant views and other objects.
You can associate the remaining category and tag with the following search logic:
ghcn - monthly: search
monthlyin search dialog and select all the matching views
predict-building-consumption: select everything in database
You can now use your categories and tags in searching your data assets. In the example below, we look for views with tag
ghcn - daily and the token
site00 in their name, fields names or descriptions.
As already mentioned, the Data Catalog not only allows you searching the data assets defined in the Denodo virtual layer, but enables graphically querying those assets with no SQL skills required.
As an example let's say that you want to extract the meter reading (
meter_reading) values, just for the first quarter 2016 and for electricity meter (
meter=0), with some building information such as the build year (
built_year) and the size (
square_feet). Notice that although the information is not in a single view, you will be able to do your extract thanks to the existence of view associations defined in the virtual layer that permit multi-views queries.
So, to build your query head to the Search panel, clean all the filtering criteria that may be already defined and then search for
iv_meter_reading. By clicking on the
iv_meter_reading view, you will be redirected to the view summary page, in which you will find the view description, all the tags and categories it may be associated with and the schema of the view with field names, descriptions and types. If you jump to the Relationships tab, you will see that this view has two relationships with views
iv_weather. You can display associations details by hovering on the arrow.
Relationships, also called Associations, play an important role in different contexts of data modeling in Denodo, among the benefits they bring is the fact of giving the ability to perform multi-view queries to the Data Catalog users.
You can now move on and build your query. Go to the Query tab, then:
square_feetin the Output Columns zone. Note that
iv_building, that you find in Relationship Fields area.
getyear(timestamp_in) = 2016 and getquarter(timestamp_in) = 1 and meter = 0
Your query tab should now look like the following:
It is quite handy to have the ability to inspect the VQL code (just click on the icon), for example for sharing with a Denodo developer or just to understand query details, such as the join key in multi-view queries. You can now run the query to display the results in the same browser tab or export it for further analysis in a third party tool. You will need to choose the export format (CSV, Excel and Tableau) as well as the number rows to be extracted.
Once you have exported the data, click on Save() and give the query meaningful name and description.
There are several actions that you may want to do now on the saved query:
You will now deploy the view to the Virtual DataPort Server. To do that, click on the My Query button on the top menu bar. In the Query Explorer, locate the target query and click on .
In the deploy menu fill in the required details (Virtual DataPort Server, credentials, view name and description and folder), then click Deploy.
Electricity meter readings for all sites in first quarter 2016 (deployed from Data Catalog)
Replace if exists
The query is now deployed, you may want to open, execute and reuse the view as per your needs in the Web Design Studio.
With the query deployment, you have completed the proposed tasks in the Data Catalog. We hope that the exercises have made you understand its potential as a data exploration and discovery tool, featuring an advanced graphical query building tool, keyword-based search capabilities and categories and tags as flexible concepts for data assets governance.
The next session will be focused on Apache Zeppelin for Denodo which adds to the Apache Zeppelin community version a built-in Denodo interpreter and integrated authentication with the platform.
In Apache Zeppelin for Denodo, data scientists will leverage a notebook-based environment in which they can alternate code paragraphs, to perform a computation or to generate a plot, with free markdown-formatted text paragraphs, to write down notes or explanations of what they are doing. Thus, it is the ideal environment to perform exploratory data analysis, a core phase to prepare for machine learning algorithm training and validation.