Connecting to Apache hive

Apache Hive is a software that facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

Configuring Hive

Before we start with the Big Data integration using Hive, we are going to add a sample file with a list of clients to our Apache Hadoop distribution. This file represents the potential clients data we obtained from the marketing department. You can find the file under <tutorial_directory>/Datasources/BigData/newClients.csv.

Feel free to use any Apache Hadoop distribution to follow this tutorial.

For this tutorial, we will assume that Apache Hive is already installed.

NOTE

To create a Hive table, login to the system and follow these steps:

  1. Save the newClients.csv file to some location in the local file system.
  2. Add the file to the Hadoop Distributed File System (HDFS): $ hadoop fs -copyFromLocal /path/newClients.csv /home/denodo/
  3. Open the Hive command line shell and create a Hive table called prospect for the new client data: hive > CREATE TABLE prospect ( name STRING, last_name STRING, client_type INT, area_code INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  4. Load the data from the newClients.csv file into the new table: hive > LOAD DATA INPATH /home/denodo/newClients.csv OVERWRITE INTO TABLE prospect;
  5. Now that you have created the table and loaded the data, you can check if everything is working as expected by running the following query: hive > SELECT * FROM prospect;

It should return a few records.

Hive Set Up

Creating the JDBC Hive data source

Once the data is incorporated into the Hive table it will be very easy to access this table from the Denodo server using Hive's JDBC drivers. You could download the JDBC driver of a Hive database from the vendor’s website. Then, copy the jars and navigate to the folder DENODO_HOME/lib-external/jdbc-drivers. Here, you will see a folder for each adapter. Paste the jar files of the driver in the appropriate folder.

  1. Create a new folder called Big Data.
  2. Right-click on the Big Data folder and select New > Data source > JDBC.
  3. Name the data source hive_ds.
  4. Select Hive 2.0.0 (HiveServer2) as the Database Adapter and make sure that you select the one that matches your Hive installation. The data source will not work, otherwise.

    If your version is not listed, you will have to use any of the available Hive options as Database adapter and then configure the Driver class path in the JDBC data source to point to the directory where your Hive driver and the additional libraries needed are stored.

    TIP
  5. Change the Database URI to match your Hive installation, for example, jdbc:hive://localhost:10000/default.
  6. Fill in the login and password fields, as needed.
  7. Click and then Create base view.
  8. The Administration Tool will show the introspected schema of your installation.

To incorporate some of the tables into the Denodo Virtual Schema, you have to check the box next to the tables or views you want to import. In this case, check prospect and then click the Create selected button.

You can later query this base view, or combine its data with data from other views.

When the importing process is finished, you will see the new views in the elements tree panel. If you double-click on the view name, the schema of the base view is shown in the workspace.

If you execute a query on the recently created view, Denodo will delegate the query to the Hive data source. Hive will translate the query into the necessary MapReduce jobs and it will return the results to Denodo. The results returned should be the same as the ones that were returned when executing the same query from the Hive command line.

There you go, your own Hive datasource in Denodo!