Connecting to Apache hive

Using Apache Hive

Apache Hive is a software that facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

Configuring Hive

Before we start with the Big Data integration using Hive, we are going to add a sample file with a list of clients to our Apache Hadoop distribution. This file represents the data that is obtained by the marketing department. You can find the file under <tutorial_directory>/Datasources/BigData/newClients.csv.

Feel free to use any Apache Hadoop distribution to follow this tutorial.

For the tutorial we will assume that Apache Hive is already installed as part of the system.

NOTE

To create a Hive table, log in to the system and follow these steps:

  1. Copy the newClients.csv file to some location in the local file system.
  2. Add the file to the Hadoop Distributed File System (HDFS): $ hadoop fs -copyFromLocal /path/newClients.csv /home/denodo/
  3. Open the Hive command line shell and create a Hive table called prospect for the new client data: hive> CREATE TABLE prospect ( name STRING, last_name STRING, client_type INT, area_code INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  4. Load the data from the newClients.csv file into the new table: hive> LOAD DATA INPATH '/home/denodo/newClients.csv' OVERWRITE INTO TABLE prospect;
  5. Now that you have created the table and loaded the data you can check if everything is working as expected running the followin query: hive> SELECT * FROM prospect;

It should return a few records.

Hive Set Up

Creating the JDBC Hive data source

Once the data is incorporated into the Hive table it will be very easy to access this table from the Denodo server using Hive's JDBC drivers.

  1. Create a new folder called ‘Big Data’.
  2. Right-click on the ‘Big Data’ folder and select “New > Data source > JDBC”.
  3. Name the data source, for example, hive_ds.
  4. Select "Hadoop / Hive 0.10.0" as Database Adapter (from the different Driver versions available (0.7.1, 0.11.0, ...) make sure that you select the one that matches your Hive installation. Otherwise, the data source will not work).

    If your version is not listed you will have to use any of the available Hive options as Database adapter and then configure the Driver class path in the JDBC data source to point to the directory where your Hive driver and the additional libraries needed are stored.

    TIP
  5. Change the Database URI to match your Hive installation, for example, jdbc:hive://localhost:10000/default.
  6. Fill login and password fields.
  7. Click "Ok" twice.
  8. The Administration Tool will show the introspected schema of your installation.

To incorporate some of the tables into the Denodo virtual schema, you have to check the box near the tables or views you want to import. In this case check prospect and then click on the Create selected base views button.

You can later query this base view or combine its data with data from other views.

When the importing process is finished, in the elements tree panel you will see the new views. If you double-click on the view name, the schema of the base view is shown in the workspace.

If you execute a query on the recently created view, Denodo will delegate the query to the Hive data source. Hive will translate the query into the necessary MapReduce jobs and it will return the results to Denodo. The results returned should be the same as the ones that were returned when executing the same query from the Hive command line.