Task 2 - Data Tagging And Exploration With Data Catalog

Data Science and Machine Learning

In the previous session we have walked through the creation of a data transformation flow to read and integrate the GHCN monthly data from a specific site.

The focus of this session is to understand on how we can leverage the Data Catalog to:

  • Defining tags and categories structures that matches the functional meaning of data assets.
  • Tagging and categorizing the data assets defined in the virtual layer, with the freshly defined tags and categories.
  • Graphically build and run multi-view queries thanks to relational associations
  • Deploy the query in the Virtual DataPort Server.

You start by signing in to the Data Catalog application. In your browser open the following url:

http://denodo.dstutorial.com:9090/denodo-data-catalog/Login

Then enter the username and password as specified below, then click on Sign In


Username: dstutorial_usr
Password: dstutorial_usr
Virtual DataPort Server: localhost - dstutorial

As first step, define the categories ghcn - monthly and ghcn - daily with parent category weather. In the top menu bar, click on Administration->Catalog Management->Categories, the Category Management window is opened. There are already two categories defined, weather is the parent category of the ones to be defined.

Click on the New Category icon () then fill in the information as shown:

Proceed similarly with the category ghcn - monthly.

At the end of this step you should have the following categories structure:

Now go on with the creation of a tag, called predict-building-consumption. Click on Administration->Catalog Management->Tags to open the Tag Management menu. Once there, click on the New Tag icon (), then fill in the information as shown below (as you can see the description can be formatted in HTML).

As no tags were already defined, the Tag Management menu should look as the following after the tag creation:

Now you can associate the views to the categories and tag. The process is the same for tags and categories, so we are showing how to complete it for a category and you can complete the remaining ones on your own.

Go back to the Category Management menu (Administration->Content Management->Categories) then click on the edit icon () of the category that you want to modify, then click on the button Add from Database/Folders tree. A new window will be displayed in which you can search and select the items that you want to associate to the category: for example for category ghcn - daily, looking for daily allows you to browse for the relevant views and other objects.

You can associate the remaining category and tag with the following search logic:

  • ghcn - monthly: search monthly in search dialog and select all the matching views
  • predict-building-consumption: select everything in database dstutorial

You can now use your categories and tags in searching your data assets. In the example below, we look for views with tag predict-building-consumption, category ghcn - daily and the token site00 in their name, fields names or descriptions.

As already mentioned, the Data Catalog not only allows you searching the data assets defined in the Denodo virtual layer, but enables graphically querying those assets with no SQL skills required.

As an example let's say that you want to extract the meter reading (meter_reading) values, just for the first quarter 2016 and for electricity meter (meter=0), with some building information such as the build year (built_year) and the size (square_feet). Notice that although the information is not in a single view, you will be able to do your extract thanks to the existence of view associations defined in the virtual layer that permit multi-views queries.

So, to build your query head to the Search panel, clean all the filtering criteria that may be already defined and then search for iv_meter_reading. By clicking on the iv_meter_reading view, you will be redirected to the view summary page, in which you will find the view description, all the tags and categories it may be associated with and the schema of the view with field names, descriptions and types. If you jump to the Relationships tab, you will see that this view has two relationships with views iv_building and iv_weather. You can display associations details by hovering on the arrow.

Relationships, also called Associations, play an important role in different contexts of data modeling in Denodo, among the benefits they bring is the fact of giving the ability to perform multi-view queries to the Data Catalog users.

You can now move on and build your query. Go to the Query tab, then:

  • Drag and drop timestamp_in, meter_reading, year_built and square_feet in the Output Columns zone. Note that year_built and square_feet belong to iv_building, that you find in Relationship Fields area.
  • In the Filters zone, click on the Add icon () then enter the following expression:

getyear(timestamp_in) = 2016 and getquarter(timestamp_in) = 1 and meter = 0

Your query tab should now look like the following:

It is quite handy to have the ability to inspect the VQL code (just click on the icon), for example for sharing with a Denodo developer or just to understand query details, such as the join key in multi-view queries. You can now run the query to display the results in the same browser tab or export it for further analysis in a third party tool. You will need to choose the export format (CSV, Excel and Tableau) as well as the number rows to be extracted.

Once you have exported the data, click on Save() and give the query meaningful name and description.

There are several actions that you may want to do now on the saved query:

  • Modify and enhancing it at a later moment.
  • Share it with peers, by generating a share url.
  • Deploy it back to the underlying Virtual DataPort Server. This is useful when you wish to trigger the industrialisation phase of the query or just make it available as-is to peers as a standard Denodo view.

You will now deploy the view to the Virtual DataPort Server. To do that, click on the My Query button on the top menu bar. In the Query Explorer, locate the target query and click on .

In the deploy menu fill in the required details (Virtual DataPort Server, credentials, view name and description and folder), then click Deploy.

Server URI

//localhost:9999/dstutorial

User

dstutorial_usr

Password

dstutorial_usr

Target Folder

(dstutorial)/08_deployed_queries

View name

readings_electricity_2016Q1

View description

Electricity meter readings for all sites in first quarter 2016 (deployed from Data Catalog)

Replace if exists

LEAVE BLANK

The query is now deployed, you may want to open, execute and reuse the view as per your needs in the Web Design Studio.

With the query deployment, you have completed the proposed tasks in the Data Catalog. We hope that the exercises have made you understand its potential as a data exploration and discovery tool, featuring an advanced graphical query building tool, keyword-based search capabilities and categories and tags as flexible concepts for data assets governance.

The next session will be focused on Apache Zeppelin for Denodo which adds to the Apache Zeppelin community version a built-in Denodo interpreter and integrated authentication with the platform.

In Apache Zeppelin for Denodo, data scientists will leverage a notebook-based environment in which they can alternate code paragraphs, to perform a computation or to generate a plot, with free markdown-formatted text paragraphs, to write down notes or explanations of what they are doing. Thus, it is the ideal environment to perform exploratory data analysis, a core phase to prepare for machine learning algorithm training and validation.