DATA DISCOVERY

The Data Catalog is a web based self service tool included in Denodo Platform that lets both technical and business users query, search and browse information and metadata stored in a Virtual DataPort server. With this tool, users can generate new knowledge and pave the way to take better decisions.

Scenario

In this tutorial, we are going to show this use case:

The IT / Data department of our company has frequent requests for access to data. These requests are usually not informed as to the types and locations of company data, and usually the requests take much longer to process than is necessary due to the lack of understanding of the underlying systems by the business user.

For solving that use case, following this tutorial you will learn how to:

If you have followed previous tutorials, in your Virtual DataPort you will have something similar to this:

Launching the Data Catalog

The Data Catalog is a software distributed as a web application included as part of the Denodo 8.0 that offers data analysts, business users and application developers searching and browsing capability of data and metadata in a business friendly manner for self-service exploration and analytics.

For starting this web tool, you have to open the Denodo Platform Control Center, and start the Data Catalog. Once it changes the status to "Running", click the Data Catalog link to open the Web tool (by default, https://127.0.0.1:9090/denodo-data-catalog).

Now login to the Data Catalog using the standard login details (admin/admin):

The first time you login to the Data Catalog, you will notice the Synchronize Metadata popup window. This needs to be run when you open the Data Catalog for the first time, in order to ensure that the Data Catalog reflects the latest state of the Denodo 8.0 server you are connected.

Run the VDP Synchronization as follows:

  1. Click the Synchronize the metadata now link.
  2. Click Continue on each Synchronization step.
  3. The views are now synchronized so you can start exploring!

Using the Metadata Search

Our first example is from the Data Catalog home screen.

Let's use the scenario of the Business Analyst to explore a simple use case, by searching for clients, by typing in client and hitting enter.

Here we have the results of our search. From Data Catalog 8.0, this search will seek views or web services that contain the query terms in the element's metadata, such as:

  • Its name.
  • Its description.
  • The names of its fields.
  • The descriptions of its fields.
  • The values of any custom properties it has assigned.


For example, let's click on the view client to be taken to the summary of the selected view:

For now we have done a search in the Virtual DataPort metadata. In the next section, we will investigate more advanced functions of the Data Catalog!

We are now going to explore the features that offer more in depth interrogation of a view in the Data Catalog. This includes:

  • Querying a view and filtering results
  • Exporting results to a file
  • Creating new fields
  • Saving queries
  • Exploring view relationships
  • Exploring data lineage
  • Querying views with relating fields

Data Catalog View Exploration

From the previous section, we have selected our client view. We can now explore the contents of this view.

Summary Tab

Under the Summary Tab, we can see a summary of the selected view. It will show the metadata of the selected view such as the database name, the list of the categories, the list of the tags, collaboration information like Endorsement and Warnings provided by the user. Clicking on the Edit button beside the Description option you will be able to edit the description of the view. In case, the view is deprecated, an indication will appear in the summary tab at the top.

Additionally, the Summary tab includes buttons like Add Tags/Categories (more details in this section of the tutorial), Collaboration options to customize the view further and also buttons like Connection URLs, Tableau to show different ways to connect to the view/datasource.

Schema Tab

Under the Schema Tab, we can see a schema of the view, with the view description, all the fields and types. Clicking on the Edit button beside the column we can add the field description. We can also search for fields, data types and descriptions using the search option on top of each section.

Query Tab

The next tab is the Query Tab. Here Ad-hoc queries can be run against the view (the query is created graphically).

For our view, select the following fields all and drag the fields into the Output columns area.

  • client_id
  • name
  • surname
  • client_type

Now click Execute, to get the results:


Of course, the Data Catalog allows to export the results! You can select CSV, HTML, Excel or Tableau as output format by clicking the Export button.

More options are available when querying a view

If we want to filter the results of the view, and order the results by price, we can easily do so. Click the Definition bar to bring back the query options.

Begin by dragging the field by which we want to filter, for example dragging field client_type to the Filters section.

We will now need to add an expression, we can add = and '02'. We also add the surname field to the Order By section for which we want to order the results by, and click the arrow to change the Order By to descending order.


Now click Execute. The results now are filtered to only include results for client_type = '02', and the results are ordered by the surname field.

We can further manipulate the resulting set by using the Add feature!

Let us consider the scenario where we want to combine the name and surname fields into a new full_name field. We can do this by concatenating the name with the surname following these steps:

  1. In the Output columns section, click on three dots and then click on Add option.

  1. In the "New output field" dialog, click on the Edit button beside the Field name column and provide field name as full_name and Expression as concat(name,' ',surname)

  1. Our results include the newly created full_name field.

  1. If we would like to save this query for later use, we can click Save. This will save the query under the My Queries section.

Relationships Tab

The next tab is the Relationships tab, which shows the associations created between views.

This is useful for the business user to understand how certain views are related. You can click on the ‘i' icon to see the related view information.

Queries involving views with relationships

It is possible to join and execute simple queries in the Data Catalog by using the Relationship Fields option. These relationships are the same as explored in the Relationships tab, which are defined in the Virtual DataPort Server.

Let's return to the Query tab of the client view. In the Relationship Fields section, we see address. This is due to the relationship defined in the Linked Data tutorial. Now you can add the field address / state (see screenshot below):

If we execute this view, we will see the results set contains the newly added address / state field.

Data Lineage Tab:

The lineage tab displays a tree graph with all the data sources and views used to build the current view.

If we click on one of the fields under View fields, we will be able to see the lineage of a specific field. This is especially useful when dealing with complicated derived views, as we will explore later.

By clicking on a node, you can see the details of the corresponding data source or view (e.g. Name, Type, Description, Projected fields, Join conditions, etc).

Lineage of Complex Views

Let us now view the lineage of a more complex view.

Return to the Search page and search for client_with_bills. Open this view and navigate to the Data lineage tab and select the primary_phone field.

We can now see the value of the Data lineage tab, where we can identify the lineage of the primary_phone field including all of the operations involved with the field.

In the next section we will explore Indexing data to enable the Content search functionality (note that until now, the Search form was only searching in the Metadata but not in the data returned by those views!).

Note: the next section is oriented to technical people who wants to know how to enable that functionality. If you need only to learn how to use it, please skip that section clicking here.

In this section We will explore the features of the Data Catalog Content Search. With this feature you can use Denodo Scheduler to index the content of your views using either ElasticSearch or the Denodo Scheduler Index Server. You can then allow your users to perform Google-like searches on them, and to customize how they see the search results.

In our example we are going to index the fields of the client view, to allow more rapid discovery of client details.

Index Creation & Configuration

Our first Step is to configure an Index. Let's see how to do that using Denodo Scheduler

Creating an Index in the Denodo Scheduler

Start the Scheduler Server, the Scheduler Index Server and the Scheduler Administration tool from the Denodo Control Center. Once these are all running, open the Scheduler Administration Tool by clicking on the link (by default: http://127.0.0.1:9090/webadmin/denodo-scheduler-admin).


Create the Index following these steps

  1. In the login screen of the Scheduler Administration Tool, provide the login details admin / admin and URI of the Scheduler Server. The URI of the server has the format //<host>:<port>.

  1. In the Denodo Scheduler we need to create a new job to create and maintain the Index. Click Add Job > VDPIndexer

  1. Give the Job a suitable name, in this case index_clients.

  1. Choose the following settings Under the Extraction section, while leaving the rest to default/blank:
  • Data Source: vdp
  • Database: tutorial
  • View: tutorial.client
  • Indexing process name: tutorial.client

  1. Under the Exporters section, click Add Exporter > Scheduler-Index and choose the following settings while leaving the rest to default/blank:
  • Data Source: Scheduler-Index
  • Index name: ix_client

  1. Save the Scheduler Job. Once the job is saved, you can execute the job by clicking three dots under the Processed (Tuples/Errors) column and then the start option.

  1. The job will execute and once successfully complete, the Result status will change to COMPLETE, indicating that the Index has been populated.

Configuring the Index in the Data Catalog

We now need to configure the newly created Index in the Data Catalog, in order to ensure that the Data Catalog includes the Index as part of the searchable content.

  1. Open the Data Catalog and navigate to Administration > Set-Up > Content Search.
  2. In the Administration window, click on Content Search option

  1. Click on + Add server option under Index Servers tab.

  1. Add the details as follows to the Add New Index Server screen.
  • Name: TutorialIndex
  • Type: Scheduler Index
  • Description: Tutorial Index
  • Host: localhost
  • Port: 9000
  • Login: admin
  • Password: admin
  1. Click Ok.

  1. Go to the Configuration tab, Click the Pencil Icon.

  1. In the Search Index Path screen, add the following details:
  • Index Type: Scheduler Index
  • Index Server: TutorialIndex
  • Index Name: ix_client
  1. Click Ok.

  1. The Index will display a green checkmark under the Configured column to indicate that the Index was added successfully.

DONE! In the next section we will see our new Index in action!

We can now use the Index feature to explore data using the Content Search function.

Indexed View Exploration

  1. In the Data Catalog, navigate to the Search page, and select the following options:
  • Data type: Content (this option appears only after configuring the index following the steps of the previous section)
  • Database: tutorial
  • View: client

  1. Type James into the search field and hit Enter to run the search:

  1. The search will return all Content that includes the string James. Click the Plus Icon (+) next to the Preview results in order to expand the results to show the field that matches the search. expand the results to show the field that matches the search.

  1. You can also click the Client view name and see the filtered data. Using the Search tab, you can search the index directly.

  1. For example, we can now search Jack, and the results from the Index are returned.

Completed! In the next section we will explore the features of the Data Catalog View metadata.

In this section We will explore the features of the Data Catalog metadata. With this feature you can use Denodo Data Catalog to add tags and categories to views, as well as update the view and field descriptions.

In our example we are going to: (1) add descriptions to the client fields, to allow more specific discovery of this view, (2) add tags and categories and (3) apply them to our client view.

Data Catalog Metadata

A useful feature of the Data Catalog is the ability to display view metadata, such as the View Description, as well as the Field Descriptions. Let's see how to modify that information.

Editing View and Field Descriptions

  1. Navigate to the Summary page of the Client View and click Edit option beside Description.
  2. Add the appropriate descriptions to the View and click Ok.

  1. Similarly, add a description to fields by navigating to the Schema tab and click on the Edit option under each field.

  1. The view now displays the added descriptions. These descriptions are saved in the Data Catalog metadata. (Note: It is recommended to synchronize the metadata with the Virtual DataPort server inorder to keep the Data Catalog synchronized. You have to use the option Administration > Sync with VDP.
    Please note that you can synchronize the Virtual DataPort server metadata changes in Data Catalog but the Tags & Categories created in Data Catalog cannot be synchronized with the Virtual DataPort Server.

Adding Categories and Tags to the Data Catalog metadata

Tags & Categories are useful to allow users to search with more accuracy through the Data Catalog. While the amount of Data Sources and Views is small in our tutorial, it will pay off over the long term to maintain good Categorization and Tagging habits to allow users to navigate the Data Catalog more easily.

Adding Categories

  1. Navigate to Administration > Set-up and Management.
  2. In the Administration window under Catalog Management, click on Categories option

  1. Click the + Add Category icon.

  1. Create a category with the following details:
  • Name: Customer
  • Description: Data sources relating to customer

  1. Create another category with the following details:
  • Name: CRM
  • Description: Acme_crm System
  • Parent: Customer

  1. Create a final category with the following details:
  • Name: Billing
  • Description: Customer Billing
  • Parent: Customer

We now have a useful set of categories to link to our Views.

Adding Tags

  1. Navigate to Administration > Set-up and Management.
  2. In the Administration window under Catalog Management, click on the Tags option.
  3. Click the + Add Tag icon and create a new Tag with the following details:
  • Name: JDBC
  • Description: JDBC data sources

  1. Create another tag with the following details:
  • Name: SOAP
  • Description: SOAP Data Sources

We now have a useful set of tags to link to our Views.

Modify views for adding Categories and Tags

  1. We can now navigate to the Client view and click on the Add Category button in the Summary tab.
  2. Click CRM and then click Ok.

  1. Now select the Add Tag button in the Summary tab and select JDBC. Click Ok.

  1. We have now added this view to the Customer > CRM category and have tagged it with the JDBC tag.

Browse using Tags & Categories

  1. To start browsing your views and web services by tags, go to Browse > Tag.
  2. In the sidebar, you will see the list of tags available in the Data Catalog.

  1. Click the tag JDBC to see the elements that have been assigned with this selected tag.

  1. Similarly to browse by categories, go to Browse > Category.
  2. From the list of categories in the sidebar, expand the category Customer > CRM to explore its subcategories.

  1. Click CRM to explore its views and webservices.

We have now seen how the effective use of the Descriptions, Categories and Tags can enable powerful data exploration.

In the next section, we will explore about Recommendations and Collaborate options in Data Catalog

In this section, we are going to explore the new features of Data Catalog 8.0, offered as two feature packs:

  • The "AI Feature Pack" provides AI-driven recommendations of datasets to users.
  • The "Semantics Feature Pack" allows for collaboration among users by adding endorsements, warnings and deprecation notes to views and web services.

The Feature Packs are licensed separately from the Denodo Platform. To begin using a Feature Pack you do not need to install a new component, only install a new license file.

In our example, we will focus on these feature packs using the client view.

AI Feature Pack

The AI Feature Pack includes the Automatic recommendation of datasets in the Data Catalog to help you discover new elements among the data resources of your company.

Automatic recommendation of datasets in the Data Catalog

With this feature, Data Catalog displays personalized recommendations to the users, based on the past activity in the Data catalog such as datasets that are most used, recently used, recommended etc.

  1. To see the recommendations, go to the homepage of the Data Catalog.
  2. The homepage presents you with a selection of items organized by different topics including a topic named Recommended to you. This recommendation of datasets is only available with the AI Feature Pack.

Semantic Feature Pack

The Semantics Feature Pack includes Collaboration in Data Catalog to allow Data Stewards to better communicate with their business users.

Collaboration in Data Catalog

In this section, we will see how we can create the following collaborative options in the Client view.

  • Endorsements.
  • Warnings and
  • Deprecation notes to views and web services.

Endorsements

The endorsements are the comments by users on a view or a webservice to show their support. A user can only endorse a view or web service once, meaning, when a new comment is written, the previous endorsement will be replaced.

  1. To create endorsement, navigate to the Summary tab of the Client view and click on Collaboration > Endorse option.

  1. In the Endorse dialog, provide the details which you would like other users to see. For example, add the details as follows:
    "This Client view is a key component of our model. It is associated with Address view to give expanded information about each client."
  2. Click Ok to save the endorsement.

  1. In the Summary tab, the Endorsed by label displays the number of endorsements on this view and their authors. Mouse over on an author say, 'admin' to see the endorsements comment.

Warnings

Warnings are used to write and display the "advise against" messages on views and web services by users. A user can write only one warning against a view or web service.

  1. To create a warning message, go to the Summary tab of the Client view, click on Collaboration > Warn option.

  1. In the Warn dialog, add the following warning information:
    "This view will be updated with delta records once in a week"
  2. Click Ok to save the warning message.

  1. In the Summary tab, the Warning by label displays the number of warnings on this view and their authors. Mouse over on an author say 'admin' to see their warnings.

Deprecation

Deprecations are used for informing users that it is obsolete and should not be used anymore. A user can write only one deprecation about a view or web service.

  1. To deprecate a view, go to the Summary tab of the Client view, and Click on Collaborate > Deprecate option.

  1. In the Deprecate dialog, we will add the following deprecation notes:
    "This view will be deprecated from next cycle. Users will be notified about the latest view by the end of this month."
  2. Click Ok to save the deprecation note.

  1. In the Summary tab of the view, you will see the ⚠ icon in the toolbar and a notification will pop up every time you click on the icon or access the view.

GREAT! We have now seen how the recommendations and collaborative features help users in Data Catalog.

In this tutorial, we have only had a limited number of Views, Data Sources, Tags and Categories, but it is clear that through the use of the Data Catalog, business users will be able to explore the companies data, easily and quickly, with minimal overhead on the IT team. We have also learnt about how the feature packs included in the data catalog can be used and how it helps users in a collaborative environment.

Thanks!