Applies to:
Denodo 8.0
,
Denodo 7.0
Last modified on: 10 Dec 2020
Tags:
Data Science
Dataframe
JDBC driver
Jaydebeapi
Numpy
ODBC driver
Pandas
Pyodbc
Python
Turbodbc
Python is one of the most widely used programming languages in the Data Scientists community for its flat learning curve and a very rich analytical stack, that spans from data preparation (e.g. pandas, numpy), through machine learning (e.g. scikit-learn) to data visualization (e.g. bokeh, seaborn, Apache Superset). Denodo can get its spot in the Data Scientist workflow as a powerful data preparation and governance tool, that offers the most optimized and widest access both to internal and external data sources.
For this reason, in this article we will discuss the different options available for establishing the connection and importing data from Denodo to Python and how we can transform them into pandas dataframes, the data structure that represents the starting point of data exploration, analysis and visualization as well as training of machine learning algorithms.
As we are addressing pandas as a computing environment, we are limiting this analysis to single-machine environments, with data batches that must be loadable in memory.
Establishing the connection
There are several Python libraries that we can use to establish the connection between Python and Denodo, we can see an overview of the main ones in the following table:
Library |
Protocol |
Project Website |
Comment |
pyodbc |
ODBC |
The most popular and most mature ODBC-based library to connect to relational databases. |
|
turbodbc |
ODBC |
Turbodbc developers explicitly target Data Scientists. This library has built-in numpy and Apache Arrow optimizations |
|
jaydebeapi |
JDBC |
The main JDBC-based library, it makes use of JPype to bridge Python and Java code |
|
psycopg2 |
ODBC/libpq |
Psycopg is a popular adapter for PostgreSQL that can be used also for Denodo. |
To all these libraries, Denodo is exposed as a standard relational database. All these libraries comply with the Python standard DB-API v2.0.
If you are more familiar with SQLAlchemy, you can also consider using the Denodo Dialect for SQLAlchemy
To be able to establish the connection between Denodo and Python through ODBC you must install the ODBC driver in the machine where your Python interpreter is installed.
If your machine runs Windows, follow the instructions here.
If your machine runs Linux or other Unix OS, follow the instructions here.
In the following example, we connect to a Denodo server using the pyodbc library and by specifying a DSN (Data Source Name).
When choosing to use a DSN-based or a DSN-less connection, take into account that by using a DSN-based connection we keep the connection script cleaner as we hide connection parameters and complexity to the final users. However DSN definition, maintenance and modification are typically a sysadmin task, meaning that their use may slow down development iterations in early phases of the projects.
## Script name: pyodbc-dsn-connection.py |
In the following example, we connect to a Denodo server with a DSN-less connection using the pyodbc library.
## Script name: pyodbc-dsnless-connection.py |
In the following example, we connect to a Denodo server using the turbodbc library in DSN-less mode. No example will be offered here with the DSN as it does not add anything new with respect to the examples above. Note that the additional feature of putting the results into a numpy array requires turbodbc 3.3.0.
## Script name: turbodbc-dsnless-connection.py ## Default port for ODBC connections
denodoserver_uid = "tpcds_usr" denodoserver_pwd = "tpcds_usr" ## An interesting feature of turbodbc is that it can return the data ## already in numpy MaskedArray, one per column. This will: ## 1. speed up the data fetch process. ## 2. dramatically speed up further calculations on the input data, in ## particular the transformation to a pandas DataFrame. ## 3. lower by a fairly large factor the memory footprint of the loaded Python ## object results_array = cur.fetchallnumpy() # >> type(results_array) # collections.OrderedDict # >> results["sr_returned_date_sk"] # masked_array(data=[2451794, 2452608, 2452694, ..., 2452388, 2452548, # 2452649], # mask=[False, False, False, ..., False, False, False], # fill_value=999999, # dtype=int64) # >> type(results["sr_returned_date_sk"]) # numpy.ma.core.MaskedArray |
In the following example we use psycopg2 to fetch data from a Denodo virtual database. The main benefit of using psycopg2 is that you don’t need to install the Denodo ODBC driver in the target machine and you do not rely on an external DSN.
## Script name: psycopg2-connection.py |
To be able to establish the connection between Denodo and Python the JDBC driver must be accessible from the machine where the Python interpreter is installed and a Java interpreter must be installed in the same machine.
In the following example, we connect to a Denodo server using the jaydebeapi library.
## script name: jaydebeapi-connection.py # This is the standard port for jdbc connections denodoserver_database = "distributed_tpcds" denodoserver_uid = "tpcds_usr" denodoserver_pwd = "tpcds_usr" denododriver_path = "/opt/denodo/8.0/tools/client-drivers/jdbc/denodo-vdp-jdbcdriver.jar" |
In this section, we will see how the results imported in python can be transformed to a Pandas dataframe, the de-facto reference data structure to perform data transformation, cleaning and analysis in Python. Note that the types_dic portion is subject to change depending on the types present in the resultset from the source.
## Script name: results-to-pandas-dataframe.py |
You can do the cursor creation, query execution, fetching and transform to pandas dataframe in a single instruction, that is pd.read_sql.
To understand the usage of this function, compare this code snippet.
cnxn = dbdriver.connect(connection_string=cnxn_str) |
With the approach used above:
cnxn = dbdriver.connect(connection_string=cnxn_str) cur = cnxn.cursor()
|
While we clearly saved several lines of code, we should always keep in mind that some optimization options may be lost by using directly this instruction. For instance it seems that read_sql is always using fetchall always and we saw that the turbodbc function fetchallnumpy can lead to better fetch time and memory usage
In this article we have seen how a data scientist or engineer may start working with Denodo as a data source and getting data into their Python computing environment from Denodo views.
In particular, we have shown how we establish the connection to a Denodo server, how to fetch data and how to transform this data into a Pandas dataframe, that is a starting point for further analysis.
From our internal testing, pyodbc and turbodbc show the best performance in data fetching and transformation time to Pandas dataframe, with turbodbc always slightly faster. Interestingly, when using the numpy optimization turbodbc gets a dramatic performance boost for the transformation to a Pandas dataframe and the total memory needed by the data fetching script is greatly reduced.
Turbodbc, that describes itself as “A turbocharged database access for data scientists”, offers the best performance but is less mature than pyodbc, so that deciding for one or the other may depend on other factors too.
Finally, the usage of psycopg or the sqlalchemy is attractive when it is preferred to avoid the installation of the Denodo ODBC driver in the client machine.
In all the above scripts we are using the password to connect to the Denodo Server in clear. This may be acceptable for testing purposes but it is often forbidden in more security-constrained installations, such as production environments.
We can then make use of an encryption mechanism to avoid this situation and one of the possibilities is to use the pycryptodome package.
Provided that pycryptodome package is installed, the following code encrypts and saves the password for the user tpcds_usr:
from Crypto.Cipher import AES |
And the following code decrypt the password with the given key. The pwd variable is to be used in the dbdriver.connect function
## Read the key and the encrypted password then decrypt it |
A test Python environment may be set up with the following steps, that will configure a working environment when deployed in the conditions listed in section Software Versions.
Installation of the required system packages:
sudo apt install g++ python3-dev unixodbc unixodbc-dev libboost-all-dev openjdk-14-jdk virtualenv |
Initialization of a Python virtual environment:
mkdir /opt/pyvenv/ |
Activation and preparation of the Python virtual environment with the required modules:
cd /opt/pyvenv/denodo-to-python |
Where requirements.txt is the environment file, that may copied and pasted from here:
appdirs==1.4.3 |
For both the operating system packages, installed via apt, and the python packages, installed via pip, of course you will need them all only when you want to access Denodo views with all the methods exposed in this article. More realistically you will choose your access method and install the minimum set of packages that allows its usage: for example openjdk is only needed for jaydebeapi, while unixodbc and unixodbc-dev are only needed for ODBC-based methods.
Python |
3.8.5 [GCC 9.3.0] on linux |
Denodo & Denodo JDBC/ODBC drivers |
8.0 |
OS |
Ubuntu 20.04.1 LTS |