Accessing HCatalog

HCatalog is a table and storage management layer for Hadoop that makes it easier to read and write data. HCatalog presents users with a relational view of the data in HDFS and ensures that users do not need to worry about where or in what format their data is stored.

In this section of the Big Data tutorial we will learn how to integrate the HCatalog REST API to access the information of the tables created in a Hadoop installation. This API allows users to send HTTP requests to access Hadoop MapReduce, Pig, Hive, and HCatalog DDL resources. These resources are accessed using the following URL format: http://<servername>/templeton/v1/resource where resource is the name of the HCatalog resource. You can see the list of HCatalog resources here

In addition to the base URL, you will also have to specify the username running the request. To do so using the default security setting, you will have to add the user.name parameter to the URL. For instance, if you want to get the list of databases in HCatalog you will have to send a request to the following URL: http://<servername>/templeton/v1/ddl/database?user.name=<your_username>

REST API Request

As you can see, if you send this request from a web browser, the output format of the REST API is JSON, so we are going to import the the HCatalog resources into Denodo using JSON data sources.

To import a new JSON data source that gets all the databases in the system:

  1. Right-click on the Big Data folder and select New > Data source > JSON.
  2. Enter hcatalog_db_list_ds as Name and select HTTP Client as Data Route.
  3. Click on Configure.
    New JSON data source
  4. In the Edit HTTP Connection window set with http://<servername>/templeton/v1/ddl/database?user.name=<your_username> as the URL, replace <servername> and <your_username> with appropriate values.
    Edit HTTP Connection
  5. Click and then Create base view.
    Base view creation with list of databases
  6. Rename the view hcatalog_db_list.
  7. Click again to confirm the creation of the base view.

Because of the hierarchical structure of the JSON returned by the HCatalog REST API, you can see that the new base view has only one top level field of type array. This array field contains a list of items with one single field (called field_0 by default) that actually contains the database name.

In the Basics Tutorial / Advanced Operations/ Flatten we learned how to flatten these fields with complex data types. Why don't you give it a try with this new view?

TIP

Now that we have a view with the databases in our Big Data deployment, it will be interesting to know the different tables created in a database. To get the list of tables using the HCatalog API you can send a request to the following URL: http://<servername>/templeton/v1/ddl/database/<databasename>/table?user.name=<your_username> where you have to replace <databasename> with the actual name of the database. If we want to import this into Denodo, we would have to create a different data source for each database in the system. But, wouldn't it be easier if we could provide the database name as an input parameter when querying the data source? Certainly, and it is possible to do this by using interpolation variables when defining the data source.

The interpolation variables start with the @ character followed by the variable name and they can be included as part of the URL when defining a data source that uses HTTP connections. For instance, we can enter the URL above as: http://<servername>/templeton/v1/ddl/database/@param_database/table?user.name=<your_username>

To import a new JSON data source that gets all the tables in a database entered as input parameter:

  1. Right-click on the Big Data folder and select New > Data source > JSON.
  2. Enter hcatalog_tables_ds as Name and select HTTP Client as Data Route. Then, click on Configure.
  3. In the Edit HTTP Connection window set http://<servername>/templeton/v1/ddl/database/@param_database/table?user.name=<your_username> as URL, replacing <servername> and <your_username> with appropriate values.
  4. If you click Ok, you will see a new Edit variable values dialog that will ask for an example value for the interpolation variables used in the data source definition. In this case, you will have to provide the name of a database that exists in the system. This value will be used to run a test request against the data source to verify that the connection is working.
    Edit Variable Value
  5. Click Ok to confirm the data source creation and then Create base view.
  6. After doing this, you will be asked again for an example value for the interpolation variable. In this case, the value will be used to run a request against the data source that processes the format of the REST API response to produce the output schema of the new view. You should enter the same value as in the data source' creation.
  7. Click Ok and rename the new view hcatalog_tables
  8. Click to confirm the creation of the view.
    Base view creation

As with the previous datasource, the list of tables is contained under a field of type array called tables. If you try to query the new base view you will see that it has a mandatory input parameter param_database and that you will have to provide a value for as a WHERE condition any time that you run a query on the base view.
Query with mandatory input parameter

If you actually run a query, you will see that it returns the tables for the database that you have specified as input parameter.

To navigate to the list of tables you can just double-click on the tables field that is displaying the [Array]... value.
Query execution with mandatory input parameter

With that, you are finished with the Big Data Tutorial!

Great Job!