Importing data using Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

In this tutorial, we will learn how to configure Sqoop to access the Denodo Platform, and how to import data coming from a Virtual DataPort view into an Apache Hadoop installation.

First of all, we will have to add the VDP JDBC driver to the Sqoop installation as Sqoop does not ship with third-party JDBC drivers. You will just have to copy the Denodo VDP JDBC driver file (<DENODO_HOME>/lib/vdp-jdbcdriver-core/denodo-vdp-jdbcdriver.jar) to the lib folder of your Sqoop installation. For instance, in a Linux-based Hadoop distribution this will typically be the /usr/lib/sqoop/lib/ directory.

Once the VDP driver has been added to Sqoop, you can use Sqoop to transfer data between the Denodo Platform and Hadoop. For instance, let's suppose that the billing department needs access to the amount_due_by_client view that we created in the Basics Tutorial from the company's Big Data system.

To add this data to the Hadoop FileSystem you will just have to run the following command: $ sqoop import --connect jdbc:vdb://hostname:9999/tutorial --table amount_due_by_client --username admin --password admin --driver com.denodo.vdp.jdbc.Driver -m 1 where:

  • The --connect parameter is the URL where the VDP server is listening, including the virtual database name.
  • --table is the name of the VDP view to transfer to HDFS.
  • --username and --password are the VDP authentication credentials.
  • --driver is the name of the class that implements the JDBC driver, always com.denodo.vdp.jdbc.Driver for connections to VDP.
  • and -m 1 is an option to perform a sequential import. This option is needed because we did not set a primary key in the view that we are importing. (If the view has a primary key this option can be removed.)
Sqoop Import

This command will create a folder in HDFS with the same name as the view that we are importing, in this case amount_due_by_client. This folder will contain the part files with the data imported from the VDP view. If what you want to do is to create a Hive table with this data, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line:

$ sqoop import --connect jdbc:vdb://hostname:9999/tutorial --table amount_due_by_client --username admin --password admin --driver com.denodo.vdp.jdbc.Driver -m 1 --hive-import

Once you run the command adding the Hive option, Sqoop will automatically add a table called amount_due_by_client to the Hive metastore and, if you query this table using the Hive command line tool, you will see the data that you just imported.

Sqoop Hive Import