You can translate the document:

Goal

This document provides an overview on the various ways to connect Denodo with Hadoop.

Overview

Hadoop is an open source ecosystem of technologies and tools for processing large volumes of data on commodity hardware. Hadoop is built around the Hadoop Distributed File System (HDFS) and MapReduce methods to data processing but a range of tools have been layered on top of HDFS, for instance, HCatalog and Hive. Various other tools have been added by vendors (e.g. Impala) or are loosely associated with Hadoop (like Apache Spark).

Since Hadoop is not a vendor supported software nor follows a single well defined standard, Denodo supports the most common deployed technology standards for connecting to Hadoop.

Technologies covered in this article

Hive

Hive is a SQL engine and data warehouse infrastructure designed to run SQL queries on HCatalog data through MapReduce jobs. Hive comes with a JDBC driver which Denodo can readily use to connect.

Impala

Cloudera Impala is a SQL engine provided with the Cloudera Hadoop distribution that provides fast interactive SQL queries directly on Hadoop data stored in HDFS or HBase. Impala provides a JDBC driver which Denodo can readily use to connect.

Sqoop

Sqoop is a connectivity tool for moving data from non-Hadoop data stores into Hadoop. Sqoop can access data prepared in Denodo through the standard Denodo JDBC driver for movement into Hadoop.

PIG

PIG is a simple scripting language and command line tool (commonly called grunt) for performing various common tasks in a Hadoop environment. PIG is commonly used for data preparation, import, export and maintenance tasks. PIG has no remote calling interface. Denodo will call PIG scripts through the Denodo-Connect SSH wrapper and consume the result of PIG scripts by other means.

Denodo Distributed File System Custom Wrapper and Hadoop

HDFS is the core of Hadoop. It is a highly fault-tolerant distributed file system designed to run on commodity hardware. Denodo can read directly from HDFS files in a Hadoop cluster through the Denodo Distributed File System Custom Wrapper. The data in the following common file formats will be parsed into a relational format:

  • Delimited text files
  • Sequence files
  • Map files
  • Avro files
  • Parquet files

HCatalog

HCatalog is a metadata and table management system for Hadoop. It enables interoperability across data processing tools such as PIG, MapReduce, Streaming, and Hive. Denodo can access HCatalog through the HCatalog REST interface.

Pivotal HAWQ

HAWQ is a parallel/distributed SQL query engine built on top of Hadoop based on PostgreSQL 8.2.15. Denodo can connect to Pivotal HAWK using standard PostgreSQL JDBC drivers.

Hadoop MapReduce jobs

Map-Reduce scripts have no remote calling interface. Denodo will call MapReduce scripts through the Denodo Connects SSH and Distributed File System Custom Wrappers and will consume the results directly.

HBase

HBase is a column-oriented NoSQL data storage environment designed to support large, sparsely populated tables in Hadoop. Denodo will connect to HBase through the Denodo HBase Custom Wrapper.

SparkSQL

SparkSQL is a SQL engine built on top of Spark. It is largely Hive compatible, but faster and has shorter response times. Denodo will connect to SparkSQL through the Hive JDBC driver.

Note that SparkSQL and Hive may coexist on the same Hadoop cluster but will connect on different ports. When browsing  Hive and SparkSQL metadata on the same Hadoop Cluster you will see the  same tables. Although the same JDBC driver is used, Denodo needs to know which technology it connects to since SparkSQL is not 100% Hive compatible.

Spark

A fast, parallel, general-purpose data processing and analytics engine with streaming capabilities. Spark is not SQL compatible and has no native remote interface. Spark is generally used to run complex analytic scripts on Hadoop clusters. Denodo can call Spark scripts through the Denodo Connect SSH Custom Wrapper and consume the resulting output, usually written to a Hive table, by other means.

Spark and SparkSQL

Note that SparkSQL is not identical to Spark. SparkSQL makes use of the Spark framework, but has different capabilities.

  • SparkSQL provides SQL capabilities and remote access, but does not allow access to Spark scripts.
  • Spark scripts allow access to a wide range of analytical libraries written in Java and Spark can internally use SparkSQL to pre-process data.

Accessing Hadoop data as a relational data source

The following technologies/tools can be integrated as JDBC data sources into Denodo:

Hive

  • Fully supported standard product feature

Impala

  • Fully supported standard product feature

SparkSQL

  • Fully supported standard product feature

HAWQ

  • Standard Product feature - PostgreSQL compatible

Other relational data sources

HDFS

  • Provided through the Denodo Distributed File System Custom Wrapper

HBase

  • Provided through the HBase Denodo Connect Custom Wrapper

Map-Reduce

  • Provided through the Distributed File System and SSH Custom Wrappers.

Accessing Denodo from Hadoop

Denodo should be accessed through the standard Denodo JDBC driver

Sqoop

  • Access to Denodo through the Denodo JDBC driver has been tested with Sqoop.

Other ways of accessing Hadoop

PIG

  • Access through the SSH Denodo Connect wrapper

Spark

  • Access through the SSH Denodo Connect wrapper

HCatalog

  • Access through standard REST wrapper

Certifications

Cloudera

The Denodo Platform is certified with the following Cloudera 5 Hadoop features for both Kerberos-secured and unsecured environments:

  • Apache Hive
  • Cloudera Impala
  • Hadoop MapReduce
  • Apache HBase
  • HDFS (Hadoop Distributed File System)

Hortonworks

The Denodo Platform is certified with the following Hortonworks Data Platform 2.1 Hadoop features for both Kerberos-secured and unsecured environments:

  • Apache Hive
  • Apache Avro
  • Hadoop MapReduce
  • Apache HBase
  • HDFS (Hadoop Distributed File System)
Disclaimer
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here