Accessing Apache HCatalog

HCatalog is a table and storage management layer for Hadoop that makes easier to read and write data. HCatalog presents users with a relational view of the data in HDFS and ensures that users do not need to worry about where or in what format their data is stored.

In this section of the Big Data tutorial we will learn how to integrate the HCatalog REST API to access the information of the tables created in a Hadoop installation. This API allows to send HTTP requests to access Hadoop MapReduce, Pig, Hive, and HCatalog DDL resources. These resources are accessed using the following URL format: http://<servername>/templeton/v1/resource where resource is the name of the HCatalog resource. You can see the list of HCatalog resources here

In addition to the base URL you will also have to specify the user name running the request. To do so using default security setting you will just have to add the user.name parameter to the URL. For instance, if you want to get the list of databases in HCatalog you will have to send a request to the following URL: http://<servername>/templeton/v1/ddl/database?user.name=<your_user_name>

REST API Request

As you can see if you send this request from a web browser the output format of the REST API is JSON so we are going to import the the HCatalog resources into Denodo using JSON data sources.

To import a new JSON data source that gets all the databases in the system:

  1. Right-click on the Big Data folder and select New > Data source > JSON.
  2. Enter hcatalog_db_list_ds as Name and select HTTP Client as Data Route. Then, click on Configure.
    New JSON data source
  3. In the Edit HTTP Connection window set http://<servername>/templeton/v1/ddl/database?user.name=<your_user_name> as URL, replacing <servername> and <your_user_name> with appropriate values.
    Edit HTTP Connection
  4. Click Ok twice and then click on Create base view.
    Base view creation with list of databases
  5. Click Ok again to confirm the creation of the base view.
  6. Rename the view to hcatalog_db_list and click Ok.

Because of the hierarchical structure of the JSON returned by the HCatalog REST API, you can see that the new base view has only one top level field of type array. This array field contains a list of items with one single field (called field_0 by default) that will actually contain the database name.

In the Basics Tutorial we learned how to flatten these fields with complex data types. Why don't you give it a try with this new view?

TIP

Now that we have a view with the databases in our Big Data deployment it will be interesting to know the different tables created in a database. To get the list of tables using the HCatalog API you can send a request to the following URL: http://<servername>/templeton/v1/ddl/database/<databasename>/table?user.name=<your_user_name> where you have to replace <databasename> with the actual name of the database. If we want to import this into Denodo we would have to create a different data source for each database in the system. But, would not it be easier if we could provide the database name as an input parameter when querying the data source? It is possible to do this by using interpolation variables when defining the data source.

The interpolation variables will start with the @ character followed by the variable name and they can be included as part of the URL when defining a data source that uses HTTP connections. For instance, we can enter the URL above as: http://<servername>/templeton/v1/ddl/database/@param_database/table?user.name=<your_user_name>

To import a new JSON data source that gets all the tables in a database entered as input parameter:

  1. Righ click on the Big Data folder and select New > Data source > JSON.
  2. Enter hcatalog_tables_ds as Name and select HTTP Client as Data Route. Then, click on Configure.
  3. In the Edit HTTP Connection window set http://<servername>/templeton/v1/ddl/database/@param_database/table?user.name=<your_user_name> as URL, replacing <servername> and <your_user_name> with appropriate values.
  4. If you click Ok you will see a new Edit variable values dialog that will ask for an example value for the interpolation variables used in the data source definition. In this case, you will have to provide the name of a database that exists in the system. This value will be used to run a test request against the data source to verify that the connection is working.
    Edit Variable Value
  5. Click Ok twice to confirm the data source creation and then click on Create base view. After doing this, you will be asked again for an example value for the interpolation variable. In this case the value will be used to run a request against the data source that processes the format of the REST API response to obtain the output schema of the new view. You should enter the same value as in the data sources creation.
  6. Click Ok twice, rename the new view to hcatalog_tables and click Ok to confirm the creation of the view.
    Base view creation
  7. As with the previous datasource the list of tables is contained under a field of type array called tables.
  8. Now, if you try to query the new base view you will see that it has a mandatory input parameter param_database and that you will have to provide a value for this field using a WHERE condition any time that you want to run a query on the base view.
    Query with mandatory input parameter
  9. And if you actually run the query you will see that it is returning the tables for the database that you have specified as input parameter.
  10. To navigate to the list of tables you can just double click on the tables field that is displaying the [Array]... value.
    Query execution with mandatory input parameter

With this section, you are finished with the Big Data Tutorial.

Congratulations!