Using Notebooks for Data Science with Denodo

Applies to: Denodo 7.0
Last modified on: 10 Sep 2018
Tags: Jupyter Notebooks Zeppelin External clients ODBC driver

Download document

Introduction

Notebooks can be saved as files, checked into revision control just like code, and freely shared. They run anywhere, thanks to their browser-based user interface. Though the most influential notebook, Jupyter, has its origins in the Python programming language, it now supports many other programming languages, including R, Scala, and Julia. The popular Apache Spark analytics platform has its own notebook project, Apache Zeppelin, which includes Scala, Python, and SparkSQL capabilities, as well as providing visualization tools.

Querying Denodo with Zeppelin Notebook

Zeppelin is an open source notebook that is primarily used with Spark. Its backend is written in Java and front end in Angular.

Configuring Denodo as an Interpreter in Zeppelin

  1. Download Zeppelin and run it. It’s also shipped with standard distros like HDP and AWS EMR.
  2. Open Zeppelin’s home page (Default url: http://localhost:8080) and click on the user name drop down at the top and click on Interpreter.

  1. Click create and select JDBC for Interpreter group. Use denodo as Interpreter Name. Fill the values for JDBC driver class name, URL, user and password:

  1. Add the location of Denodo’s JDBC driver in the Dependencies > artifact field:

  1. In the notebook make sure denodo is selected as an interpreter and then you can use %denodo to run queries against the denodo interpreter.

  1. Run your queries.

Denodo as a source in Spark using DataFrames

An common way to use Denodo data from Spark is by moving the results of a Denodo query to a remote table in Spark.

However, in many cases it will be also interesting to ingest data directly from Denodo, without the need for replication. You can do so with Spark’s DataFrames API via JDBC

To connect to Denodo from a Spark interpreter, you have to first place Denodo’s JDBC driver jar in $SPARK_HOME/jars folder and restart Zeppelin. Once it is done, you can query a view in denodo with a code like below:

%spark.pyspark

url = "jdbc:vdb://localhost:9999/machine_learning"

user = "admin"

password = "your_password"

dbtable = "text_prediction"

df = sqlContext.read.format("jdbc").option("url", url).option("user",user).option("password",password).option("dbtable", dbtable).load()

Querying Denodo with Jupyter Notebook

Installation

  1. The easiest way to start working with Jupyter is to Install Anaconda. Anaconda is a widely used python distribution with many libraries that are used by Data Analysts.
  2. Start Jupyter: jupyter notebook

Creating a Denodo DSN

Jupyter will use ODBC to access Denodo, therefore you will need to download and install the Denodo ODBC driver and create a DSN.

Creating a notebook

A browser window will be automatically opened which will list the current working directory contents. Select New > Python 

Querying Denodo

The recommended way to connect to a database using Jupyter is through ODBC with the pyodbc library. Once you have a DSN created, we can use a standard pyodbc program to query Denodo. For example:

# pyodbc library for the odbc connection

import pyodbc

# pandas for table display

import pandas

# The DenodoODBC DSN is defined in Windows DSN Manager

connection = pyodbc.connect("DSN=DenodoODBC")

# Execute the query and display using Pandas

df = pandas.read_sql("SELECT * 1 from dual()", connection)

df

Output

You can also use pandas easily to draw charts:

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training