How to connect to Denodo from Python - a starter for Data Scientists

Applies to: Denodo 7.0
Last modified on: 08 Apr 2020
Tags: Pandas Turbodbc Dataframe JDBC driver Jaydebeapi Numpy Pyodbc ODBC driver Data Science

Download document

You can translate the document:

Introduction

Python is one of the most widely used programming languages in the Data Scientists community for its flat learning curve and a very rich analytical stack, that spans from data preparation (e.g. pandas, numpy), through machine learning (e.g. scikit-learn) to data visualization (e.g. bokeh, seaborn, Apache Superset). Denodo can get its spot in the Data Scientist workflow as a powerful data preparation and governance tool, that offers the most optimized and widest access both to internal and external data sources.

For this reason, in this article we will discuss the different options available for establishing the connection and importing data from Denodo to Python and how we can transform them into pandas dataframes, the data structure that represents the starting point of data exploration, analysis and visualization as well as training of machine learning algorithms.

As we are addressing pandas as a computing environment, we are limiting this analysis to single-machine environments, with data batches that must be loadable in memory.

Establishing the connection

There are several Python libraries that we can use to establish the connection between Python and Denodo, we can get an overview of the main ones in the following table:

Library

Protocol

Project Website

Comment

pyodbc

ODBC

https://mkleehammer.github.io/pyodbc/

The most popular and most mature ODBC-based library to connect to relational databases.

turbodbc

ODBC

https://turbodbc.readthedocs.io/en/latest/

Turbodbc developers explicitly target Data Scientists. This library has built-in numpy and Apache Arrow optimizations

jaydebeapi

JDBC

https://github.com/baztian/jaydebeapi

The main JDBC-based library, it makes use of JPype to bridge Python and Java code

To all these libraries, Denodo is exposed as a standard relational database. All these libraries comply with the Python standard DB-API v2.0 .

Connection through ODBC

Prerequisites

To be able to establish the connection between Denodo and Python through ODBC you must install the ODBC driver in the machine where your Python interpreter is installed.

If your machine runs Windows, follow the instructions here.

If your machine runs Linux or other Unix OS, follow the instructions here.

Example Code: pyodbc with DSN

In the following example, we connect to a Denodo server using the pyodbc library and by specifying a DSN.

When choosing to use a DSN-based or a DSN-less connection, take into account that by using a DSN-based connection we keep the connection script cleaner as we hide connection parameters and complexity to the final users. However DSN definition, maintenance and modification are typically a sysadmin task, meaning that their use may slow down development iterations in early phases of the projects.

## Script name: pyodbc-dsn-connection.py

## Importing the main library used to connect to Denodo via ODBC
import pyodbc as dbdriver

## Importing the gethostname function from socket to
## put the hostname in the useragent variable
from socket import gethostname

## ODBC DSN. It must be configured beforehand. See instructions
## https://community.denodo.com/docs/html/browse/7.0/vdp/developer/access_through_odbc/access_through_odbc
denodoserver_dsn =
"Denodo_DSN_20200310_tpcds"

## Create the useragent has the concatenation of
## the client hostname and the python library used
client_hostname = gethostname()
useragent =
"%s-%s" % (dbdriver.__name__,client_hostname)

## Establishing a connection
cnxn = dbdriver.connect(
"DSN=%s;UserAgent=%s" % ( denodoserver_dsn , useragent ) )

## Query to be sent to the Denodo VDP Server
query =
"select * from bv_store_returns"

## Define a cursor and execute the results
cur = cnxn.cursor()
cur.execute(query)

## Finally fetch the results. `results` is a list of pyodbc.Row,
## that is a datatype similar to a tuple
## If you don't want to load all the records in memory, you may want to use cur.fetchone() or
## cur.fetchmany()
results = cur.fetchall()

# >> len(results)
# 287514
# >> type(results)
# list

# >> type(results[0])
# Out[5]: pyodbc.Row

# >> results[0]
# (2451794, 40096, 1, 7157, 910283, 6421, 37312, 8, 16, 9750, 41, Decimal('72.57'), Decimal('6.53'), ...)

Example Code: DSN-less pyodbc

In the following example, we connect to a Denodo server with a DSN-less connection using the pyodbc library.

## Script name: pyodbc-dsnless-connection.py

## Importing the main library used to connect to Denodo via ODBC
import pyodbc as dbdriver

## Importing the gethostname function from socket to
## put the hostname in the useragent variable
from socket import gethostname

## Connection parameters to the Denodo VDP Server
denodoserver_name =
"ubuntu1804-denodo7"
## ODBC driver. It must be installed beforehand. See instructions
## https://community.denodo.com/docs/html/browse/7.0/vdp/developer/access_through_odbc/access_through_odbc
odbcdriver =
"DenodoODBCDriver-20200310"
## Default port for ODBC connections
denodoserver_odbc_port =
"9996"
denodoserver_database =
"distributed_tpcds"
denodoserver_uid =
"admin"
denodoserver_pwd =
"admin"

## Create the useragent as the concatenation of
## the client hostname and the python library used
client_hostname = gethostname()
useragent =
"%s-%s" % (dbdriver.__name__,client_hostname)

## Establishing a connection
cnxn = dbdriver.connect(
                        driver = odbcdriver,
                        server = denodoserver_name,
                        port = denodoserver_odbc_port,
                        database = denodoserver_database,
                        uid = denodoserver_uid,
                        pwd = denodoserver_pwd,
                        useragent = useragent,
                    )

## Query to be sent to the Denodo VDP Server
query =
"select * from bv_store_returns"

## Define a cursor and execute the results
cur = cnxn.cursor()
cur.execute(query)

## Finally fetch the results. `results` is a list of pyodbc.Row,
## that is a datatype similar to a tuple
## If you don't want to load all the records in memory, you may want to use cur.fetchone() or
## cur.fetchmany()
results = cur.fetchall()

# >> len(results)
# 287514
# >> type(results)
# list

# >> type(results[0])
# Out[5]: pyodbc.Row

# >> results[0]
# (2451794, 40096, 1, 7157, 910283, 6421, 37312, 8, 16, 9750, 41, Decimal('72.57'), Decimal('6.53'), ...)

Example Code: DSN-less turbodbc

In the following example, we connect to a Denodo server using the turbodbc library in DSN-less mode. No example will be offered here with the DSN as it does not add anything new with respect to the examples above.

## Script name: turbodbc-dsnless-connection.py

## Importing the main library used to connect to Denodo via ODBC
import turbodbc as dbdriver

## Importing the gethostname function from socket to
## put the hostname in the useragent variable
from socket import gethostname

## Connection parameters to the Denodo VDP Server
denodoserver_name =
"ubuntu1804-denodo7"
## ODBC driver. It must be installed beforehand. See instructions
## https://community.denodo.com/docs/html/browse/7.0/vdp/developer/access_through_odbc/access_through_odbc
odbcdriver =
"DenodoODBCDriver-20200310"
## Default port for ODBC connections
denodoserver_odbc_port =
"9996"
denodoserver_database =
"distributed_tpcds"
denodoserver_uid =
"tpcds_usr"
denodoserver_pwd =
"tpcds_usr"

## Create the useragent as the concatenation of
## the client hostname and the python library used
client_hostname = gethostname()
useragent =
"%s-%s" % (dbdriver.__name__,client_hostname)

## Establishing a connection
## The connect function has an additional 'turbodbc_options' parameter that may
## be used to tune performance parameters. See the documentation for further
## information (to read the doc. from a Python client, you can issue:
## turbodbc.make_options? )
cnxn = dbdriver.connect(
                       driver = odbcdriver,
                       server = denodoserver_name,
                       port = denodoserver_odbc_port,
                       database = denodoserver_database,
                       uid = denodoserver_uid,
                       pwd = denodoserver_pwd,
                       useragent = useragent,
                       turbodbc_options = dbdriver.make_options(use_async_io=
True)
                       )

## Query to be sent to the Denodo VDP Server
query =
"select * from bv_store_returns"

## Define a cursor and execute the results
cur = cnxn.cursor()
cur.execute(query)

## Finally fetch the results. `results` is a list of lists.
## If you don't want to load all the records in memory, you may
## want to use cur.fetchone() or cur.fetchmany()
results = cur.fetchall()

# >> len(results)
# 287514
# >> type(results)
# list

# >> type(results[0])
# list

# >> results[0]
# [2451794,40096,1,...]

## An interesting feature of turbodbc is that it can return the data
## already in numpy MaskedArray, one per column. This will:
## 1. speed up the data fetch process
## 2. dramatically speed up further calculations on the input data, in
##    particular the transformation to a pandas DataFrame.
results_array = cur.fetchallnumpy()

# >> type(results_array)

# >> type(results_array)

# collections.OrderedDict

# >> results["sr_returned_date_sk"]

# masked_array(data=[2451794, 2452608, 2452694, ..., 2452388, 2452548,

#                       2452649],

#                 mask=[False, False, False, ..., False, False, False],

#           fill_value=999999,

#                dtype=int64)

# >> type(results["sr_returned_date_sk"])

# numpy.ma.core.MaskedArray

Connection through JDBC  

Prerequisites

To be able to establish the connection between Denodo and Python the JDBC driver must be accessible from the machine where the Python interpreter is installed and a Java interpreter must be installed in the same machine.

Example Code - jaydebeapi

In the following example, we connect to a Denodo server using the jaydebeapi library.

## script name: jaydebeapi-connection.py

## Importing the main library used to connect to Denodo via JDBC
import jaydebeapi as dbdriver

## Importing the gethostname function from socket to
## put the hostname in the useragent variable
from socket import gethostname


# Connection parameters of the Denodo Server that we are connecting to
denodoserver_name =
"ubuntu1804-denodo7"
# This is the standard port for jdbc connections
denodoserver_jdbc_port =
"9999"
denodoserver_database =
"distributed_tpcds"
denodoserver_uid =
"tpcds_usr"
denodoserver_pwd =
"tpcds_usr"
denododriver_path =
"/opt/denodo/7.0-sol-man/tools/client-drivers/jdbc/denodo-vdp-jdbcdriver.jar"

## Create the useragent as the concatenation of
## the client hostname and the python library used
client_hostname = gethostname()
useragent =
"%s-%s" % (dbdriver.__name__,client_hostname)

## Creating a variable with the connection uri. We add here the UserAgent
## so the query can be better identified on the server. To append parameters you
## can use the syntax <param_name>=<param_value> and separate them with '&'.
## The full list of accepted parameters is available here
## https://community.denodo.com/docs/html/browse/7.0/vdp/developer/
##        access_through_jdbc/parameters_of_the_jdbc_connection_url/
##        parameters_of_the_jdbc_connection_url
conn_uri =
"jdbc:vdb://%s:%s/%s?userAgent=%s" % (denodoserver_name,
                                                                                            denodoserver_jdbc_port,
                                                                                            denodoserver_database,
                                                                                            useragent)
                                                                                           
cnxn = dbdriver.connect(
"com.denodo.vdp.jdbc.Driver",
                              conn_uri,
                              driver_args = {
"user": denodoserver_uid,
                                             
"password": denodoserver_pwd},
                              jars = denododriver_path
                              )

## Query to be sent to the Denodo VDP Server
query =
"select * from bv_store_returns"

## Define a cursor and execute the results
cur = cnxn.cursor()
cur.execute(query)

## Finally fetch the results. `results` is a list of tuples,
## If you don't want to load all the records in memory,
## you may want to use cur.fetchone() or cur.fetchmany()
results = cur.fetchall()

# >> len(results)
# 287514
# >> type(results)
# list
# >> type(results[0])
# tuple
# >> results[0]
# (2451794, 40096, 1, 7157, 910283, 6421, 37312, ...)

Getting the data into a pandas dataframe

In this section, we will see how the results imported in python can be transformed to a Pandas dataframe, the de-facto reference data structure to perform data transformation, cleaning and analysis in Python.

## Script name: results-to-pandas-dataframe.py

## You must run one of the scripts pyodbc-*-connection.py in
## the same session so that 'cur' and 'results' objects are defined.
## You may also want to run other *-connection.py script instead
## but the types conversion will fail, as the types depend on the
## import library

import pandas as pd
import decimal
import numpy as np

## Get the column names from the cursor
columns = [c[
0] for c in cur.description]

## Get the data in 'results' into a Pandas dataframe
df_results = pd.DataFrame.from_records(results,columns = columns)

# >> df_results.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 287514 entries, 0 to 287513
# Data columns (total 20 columns):
#  #   Column                 Non-Null Count   Dtype  
# ---  ------                 --------------   -----  
#  0   sr_returned_date_sk    277502 non-null  float64
#  1   sr_return_time_sk      277568 non-null  float64
#  2   sr_item_sk             287514 non-null  int64  
#  ...
#  19  sr_net_loss            277541 non-null  object
# dtypes: float64(9), int64(2), object(9)
# memory usage: 43.9+ MB

## As we can see in the above informative table, types are not translated
## correctly: int are read as floats (because of the  presence of null values)
## and decimal are read as object. As an additional step you may want to do,
## is to fix it by defining a type mapping dictionary and then loop over
## the columns to fix their type, one by one. The types dictionary is to be
## modified if we use other libraries to import the data (ex. jaydebeapi)

types_dict = {
   int:pd.Int64Dtype(),
   decimal.Decimal:np.float
}

types = [types_dict[c[
1]] for c in cur.description]
for c,tp in  zip(df_results.columns,types):
   df_results[c] = df_results[c].astype(tp)

# >> df_results.head()
#    sr_returned_date_sk  sr_return_time_sk  ...  sr_store_credit  sr_net_loss
# 0              2451794              40096  ...            39.60       600.91
# 1              2452608              38280  ...            37.94       490.10
# 2              2452694              30733  ...            84.11        45.17
# 3              2451783              46066  ...           106.62       295.05
# 4              2452016              33347  ...             9.82       106.21
   
# >> df_results.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 287514 entries, 0 to 287513
# Data columns (total 20 columns):
#  #   Column                 Non-Null Count   Dtype  
# ---  ------                 --------------   -----  
#  0   sr_returned_date_sk    277502 non-null  Int64  
#  1   sr_return_time_sk      277568 non-null  Int64  
#  2   sr_item_sk             287514 non-null  Int64  
#  ...
#  11  sr_return_amt          277486 non-null  float64
#  ...
# dtypes: Int64(11), float64(9)
# memory usage: 46.9 MB

Conclusion and performance considerations

In this article we have seen how a data scientist or engineer may start working with Denodo as a data source and getting data into their Python computing environment from Denodo.

In particular, we have shown how we establish the connection to a Denodo server, how to fetch data and how to transform this data into a Pandas dataframe, that is a starting point for further analysis.

From our internal testing, pyodbc and turbodbc show the best performance in data fetching and transformation to Pandas Dataframe time, with turbodbc always slightly faster. Interestingly, when using the numpy optimization turbodbc gets a dramatic performance boost for the transformation to a Pandas dataframe.

Turbodbc, that describes itself as “A turbocharged database access for data scientists”, offers the best performance but is less mature than pyodbc, so that deciding for one or the other may depend on other factors too.

Appendix

Password Encryption

In all the above scripts we are using the password to connect to the Denodo Server in clear. This may be acceptable for testing purposes but it is often forbidden in more security-constrained installations, such as production environments.

We can then make use of an encryption mechanism to avoid this situation and one of the possibilities is to use the pycryptodome package.

Provided that pycryptodome package is installed, the following code encrypts and saves the password for the user tpcds_usr:

from Crypto.Cipher import AES
from Crypto.Random import get_random_bytes

key = get_random_bytes(
16)
cipher = AES.new(key, AES.MODE_EAX)
ciphertext, tag = cipher.encrypt_and_digest(
b'tpcds_usr')

file_out = open(
"/tmp/pwd.bin", "wb")
[ file_out.write(x)
for x in (cipher.nonce, tag, ciphertext) ]
file_out.close()

## Of course it is not a good idea to store the
## password and the key in the same place, but we are just giving an
## example here
key_out = open(
"/tmp/key.bin", "wb")
key_out.write(key)
key_out.close()

And the following code decrypt the password with the given key. The pwd variable is to be used in the dbdriver.connect function

## Read the key and the encrypted password then decrypt it
from Crypto.Cipher import AES

file_in = open(
"/tmp/pwd.bin", "rb")
nonce, tag, ciphertext = [ file_in.read(x)
for x in (16, 16, -1) ]
file_in.close()

file_key_in = open(
"/tmp/key.bin","rb")
key = file_key_in.read()
file_key_in.close()

cipher = AES.new(key, AES.MODE_EAX, nonce)
pwd = cipher.decrypt_and_verify(ciphertext, tag).decode(
"utf-8")

Python Environment

A test Python environment may be set up with the following steps, that will configure a working environment when deployed in the conditions listed in section Software Versions.

Installation of the required system packages:

sudo apt install g++ python3-dev unixodbc unixodbc-dev libboost-all-dev openjdk-8-jdk virtualenv

Initialization of a Python virtual environment:

mkdir /opt/pyvenv/
virtualenv --python=python3.6 /opt/pyvenv/denodo-to-python

Activation and preparation of the Python virtual environment with the required modules:

cd /opt/pyvenv/denodo-to-python
source ./bin/activate
pip install -r <requirements.txt>

Where requirements.txt is the environment file, that may copied and pasted from here:

JayDeBeApi==1.1.1
JPype1==0.6.3
numpy==1.18.1
pandas==1.0.2
pkg-resources==0.0.0
pybind11==2.4.3
pyodbc==4.0.30
python-dateutil==2.8.1
pytz==2019.3
six==1.14.0
turbodbc==3.3.0

Software Versions

Python

3.6.9 (default, Nov  7 2019, 10:44:02)

[GCC 8.3.0] on linux

Denodo & Denodo JDBC/ODBC drivers

7.0 20200310

OS

Ubuntu 18.04.4 LTS

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training