Accessing hierarchical JSON data in Hive from Denodo

Applies to: Denodo 7.0 , Denodo 6.0
Last modified on: 06 Mar 2018
Tags: Big Data Hive JDBC data sources

Download original document

You can translate the document:

Goal

This document describes how to access to hierarchical data represented in JSON format in Hive from Denodo Virtual DataPort.

Content

Hierarchical data is the data that is organized into some kind of tree structure with different information levels related between them. As an example we can use the model to represent an employee that can contain data about its id, name, telephone, address and department. At the same time we can split the field address into new fields as city, street, number,... as well as the field department could be split into new fields as name, director, budget, address…

For an employee it is possible to represent using JSON this hierarchical information as follows:

{ "id" : "1234",

  "name" : "Christopher C Laird",

  "telephone" : "212-618-9176",

  "date" : "04/17/2013",

  "address" : { "city" : "New York",

                "street" : "3204 Oakwood Avenue",

                "zipcode" : "10007"

               },

  "department" : {"name" : "Sales",

                  "director" : "Ellis L Williams",

                  "budget" : "100000",

                  "address" : {"city" : "New York",

                               "street" : "3687 Pallet Street",

                               "zipcode" : "10007"

                              }

                 }

}

Hierarchical data can be managed in Hive using different data types:

  • String. Example: JSON encoded string
  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
  • union: UNIONTYPE<data_type, data_type, ...>

In this document we will focus on the JSON encoded strings and how to access to this information from Virtual DataPort leveraging the capabilities of Hive managing JSON format.

Starting from the example presented above, the JSON example models an employee entity in a usual company. It has many fields that are relevant for the company as the name, address and telephone of the employee. In addition, it contains information about the department of the employee as well.

For our example queries, we are going to use a JSON document with three employees, the first two have information for all the fields whereas the last one does not include the department information, simulating the scenario where an employee has recently arrived to the company and not all of his information is complete yet. We will use this employee without department to demonstrate how to combine information that it is not available in all the registers of a table. Moreover, we will assume that this JSON table is called employee and that it exists in our Hive environment.

All the examples below are built using the “Create from query” option available for JDBC data sources. It is also possible to create the base view through introspection, in that case the JSON would be obtained as a single text column and then we would need to create an additional JSON data source in order to extract the fields from the plain text column using the “From variable” option. Using the “Create from query” option when querying Hive we are going to use the LATERAL VIEW clause and the json_tuple UDTF (Built-in Table-Generating Function).

In a first example, the view employee is created from a very simple query. We are going to get the fields id, name, telephone and hiring date of the employee. These fields are top level fields in the JSON so the query that we will use for this will include just one LATERAL VIEW clause:

SELECT employee_id, employee_name, employee_telephone, employee_date

FROM employee

LATERAL VIEW json_tuple (employee.json, 'id', 'name', 'telephone', 'date') emp AS employee_id, employee_name, employee_telephone, employee_date

employee view

This query works as follows: the json_tuple function extracts the attributes passed as parameter from the json string and then, the LATERAL VIEW clause appends this information to the results of the query acting as a join.

In the following example we are going to add the address information of the employee to the results of the query. For this, we will use a second LATERAL VIEW clause in order to join the employee fields with the address information:

SELECT employee_id, employee_name, employee_telephone, employee_date, addr_city, addr_street, addr_zipcode

FROM employee

LATERAL VIEW json_tuple (employee.json, 'id', 'name', 'telephone', 'date', 'address') emp AS employee_id, employee_name, employee_telephone, employee_date, employee_address

LATERAL VIEW json_tuple (emp.employee_address, 'city', 'street', 'zipcode') addr AS addr_city, addr_street, addr_zipcode

employee_address view

Now, an example to show that LATERAL VIEW can be used with JSON attributes that only appear in some of the registers. In this case we are going to add the department information to the employee and address fields. Remember that the data for one of the employees does not include department data. We will see how the department fields for this register will contain null values, allowing the query to be executed successfully.

SELECT employee_id, employee_name, employee_telephone, employee_date, addr_city, addr_street, addr_zipcode, dept_name, dept_director, dept_budget

FROM employee

LATERAL VIEW json_tuple (employee.json, 'id', 'name', 'telephone', 'date', 'address', 'department') emp AS employee_id, employee_name, employee_telephone, employee_date, employee_address, employee_department

LATERAL VIEW json_tuple (emp.employee_address, 'city', 'street', 'zipcode') addr AS addr_city, addr_street, addr_zipcode

LATERAL VIEW json_tuple (emp.employee_department, 'name', 'director', 'budget') dept AS dept_name, dept_director, dept_budget

employee_department view

As a conclusion, we have learned how to use LATERAL VIEW and json_tuple to query JSON tables through Hive from Denodo to extract the required fields and to avoid post-processing the original JSON string column on the Virtual DataPort side.

References

LanguageManual LateralView

LanguageManual UDF

Appendix

JSON file used for the examples:

{ "id" : "1234", "name" : "Christopher C Laird", "telephone" : "212-618-9176", "date" : "04/17/2013", "address" : { "city" : "New York", "street" : "3204 Oakwood Avenue", "zipcode" : "10007" }, "department" : {"name" : "Sales", "director" : "Ellis L Williams", "budget" : "100000", "address" : {"city" : "New York", "street" : "3687 Pallet Street", "zipcode" : "10007" } } }

{ "id" : "2345", "name" : "Ashley G Ramos", "telephone" : "212-841-8695", "date" : "09/21/2014", "address" : { "city" : "New York", "street" : "191 Small Street", "zipcode" : "10019" }, "department" : {"name" : "Sales", "director" : "Ellis L Williams", "budget" : "100000", "address" : {"city" : "New York", "street" : "3687 Pallet Street", "zipcode" : "10007" } } }

{ "id" : "3456", "name" : "Jeremy J Blackwell", "telephone" : "347-312-6195", "date" : "03/07/2016", "address" : { "city" : "New York", "street" : "3509 My Drive", "zipcode" : "10036" } }

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training