Goal
This document describes how to access hierarchical data represented in JSON format in Hive from Denodo Virtual DataPort.
Content
Hierarchical data is data that is organized into a tree structure that has information levels related between them. For example, we can use the model to represent an employee that can contain data about their ID, name, telephone, address and department. We can also split the field address into the fields city, street, and number, while the field department could be split into the fields name, director, budget, address…
For an employee, we can represent the hierarchical data using the JSON data format:
{ "id" : "1234", "name" : "Christopher C Laird", "telephone" : "212-618-9176", "date" : "04/17/2013", "address" : { "city" : "New York", "street" : "3204 Oakwood Avenue", "zipcode" : "10007" }, "department" : {"name" : "Sales", "director" : "Ellis L Williams", "budget" : "100000", "address" : {"city" : "New York", "street" : "3687 Pallet Street", "zipcode" : "10007" } } } |
Hierarchical data can be managed in Hive using different data types:
- String. Example: JSON encoded string
- arrays: ARRAY<data_type>
- maps: MAP<primitive_type, data_type>
- structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>
- union: UNIONTYPE<data_type, data_type, ...>
In this document, we will focus on JSON encoded strings and how to access this information from the Virtual DataPort, leveraging the capabilities of Hive JSON format management.
Starting with the example above, the JSON example models an employee entity in a given company. It has many fields that are relevant for the company, like the name, address, department and telephone of the employee.
For our example queries, we are going to use a JSON document with three employees; the first two have information for all the fields whereas the last one does not include the department information, simulating the scenario where an employee has recently arrived at the company meaning not all of his information is complete yet. We will use this employee without a department to demonstrate how to combine information that is not available in all the registers of a table. Moreover, we will assume that this JSON table is called employee and that it exists in our Hive environment.
All the examples below are built using the “Create from query” option available for JDBC data sources. It is also possible to create the base view through introspection, in which case the JSON would be obtained as a single text column and then we would need to create an additional JSON data source in order to extract the fields from the plain text column using the “From variable” option. Using the “Create from query” option when querying Hive we are going to use the LATERAL VIEW clause and the json_tuple UDTF (Built-in Table-Generating Function).
In a first example, the view employee is created from a very simple query. We are going to get the fields id, name, telephone and hiring date of the employee. These fields are top level fields in the JSON so the query that we will use for this will include just one LATERAL VIEW clause:
SELECT employee_id, employee_name, employee_telephone, employee_date FROM employee LATERAL VIEW json_tuple (employee.json, 'id', 'name', 'telephone', 'date') emp AS employee_id, employee_name, employee_telephone, employee_date |
employee view
This query works as follows: the json_tuple function extracts the attributes passed as parameters from the json string and then, the LATERAL VIEW clause appends this information to the results of the query acting as a join.
In the following example, we are going to add the address information of the employee to the results of the query. For this, we will use a second LATERAL VIEW clause in order to join the employee fields with the address information:
SELECT employee_id, employee_name, employee_telephone, employee_date, addr_city, addr_street, addr_zipcode FROM employee LATERAL VIEW json_tuple (employee.json, 'id', 'name', 'telephone', 'date', 'address') emp AS employee_id, employee_name, employee_telephone, employee_date, employee_address LATERAL VIEW json_tuple (emp.employee_address, 'city', 'street', 'zipcode') addr AS addr_city, addr_street, addr_zipcode |
employee_address view
Now, an example to show that LATERAL VIEW can be used with JSON attributes that only appear in some of the registers. In this case we are going to add the department information to the employee and address fields. Remember that the data for one of the employees does not include department data. We will see how the department fields for this register will contain null values, allowing the query to be executed successfully.
SELECT employee_id, employee_name, employee_telephone, employee_date, addr_city, addr_street, addr_zipcode, dept_name, dept_director, dept_budget FROM employee LATERAL VIEW json_tuple (employee.json, 'id', 'name', 'telephone', 'date', 'address', 'department') emp AS employee_id, employee_name, employee_telephone, employee_date, employee_address, employee_department LATERAL VIEW json_tuple (emp.employee_address, 'city', 'street', 'zipcode') addr AS addr_city, addr_street, addr_zipcode LATERAL VIEW json_tuple (emp.employee_department, 'name', 'director', 'budget') dept AS dept_name, dept_director, dept_budget |
employee_department view
As a conclusion, we have learned how to use LATERAL VIEW and json_tuple to query JSON tables through Hive from Denodo to extract the required fields and to avoid post-processing the original JSON string column on the Virtual DataPort side.
References
Appendix
JSON file used for the examples:
{ "id" : "1234", "name" : "Christopher C Laird", "telephone" : "212-618-9176", "date" : "04/17/2013", "address" : { "city" : "New York", "street" : "3204 Oakwood Avenue", "zipcode" : "10007" }, "department" : {"name" : "Sales", "director" : "Ellis L Williams", "budget" : "100000", "address" : {"city" : "New York", "street" : "3687 Pallet Street", "zipcode" : "10007" } } } { "id" : "2345", "name" : "Ashley G Ramos", "telephone" : "212-841-8695", "date" : "09/21/2014", "address" : { "city" : "New York", "street" : "191 Small Street", "zipcode" : "10019" }, "department" : {"name" : "Sales", "director" : "Ellis L Williams", "budget" : "100000", "address" : {"city" : "New York", "street" : "3687 Pallet Street", "zipcode" : "10007" } } } { "id" : "3456", "name" : "Jeremy J Blackwell", "telephone" : "347-312-6195", "date" : "03/07/2016", "address" : { "city" : "New York", "street" : "3509 My Drive", "zipcode" : "10036" } } |
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.