Introduction
Python is one of the most widely used programming languages in the Data Science community for its flat learning curve and a very rich analytical stack, that spans from data preparation (e.g. pandas, numpy), through machine learning (e.g. scikit-learn) to data visualization (e.g. bokeh, seaborn, Apache Superset). Denodo can get its spot in the Data Scientist’s workflow as a powerful data preparation and governance tool, that offers the most optimized and widest access both to internal and external data sources.
For this reason, in this article we will discuss the different options available for establishing the connection and importing data from Denodo to Python and how we can transform them into pandas dataframes, the data structure that represents the starting point of data exploration, analysis and visualization as well as training of machine learning algorithms.
As we are addressing pandas as a computing environment, we are limiting this analysis to single-machine environments, with data batches that must be loadable in memory.
Connection Methods
There are several Python libraries that we can use to establish the connection between Python and Denodo, we can see an overview of the main ones in the following table:
Library |
Denodo Interface |
Comment |
ODBC/libpq |
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. |
|
ODBC/libpq |
psycopg2 is a popular adapter for PostgreSQL that can be also used for Denodo. |
|
ODBC |
turbodbc developers explicitly target Data Scientists. This library has built-in numpy and Apache Arrow optimizations |
|
ODBC |
The most popular and mature ODBC-based library to connect to relational databases. |
|
JDBC |
The main JDBC-based python library, it makes use of JPype to bridge Python and Java code |
To all these libraries, Denodo is exposed as a standard relational database. All these libraries comply with the Python standard DB-API v2.0.
Recommendations
The easiest way to connect to the Denodo Platform from Python is to use the Denodo Dialect for SQLAlchemy or the psycopg2 driver. We recommend the use of the Denodo Dialect for SQLAlchemy for greater convenience, and due to the higher-level set of functionality it offers and the fact that it makes use of psycopg2 underneath. The SQLAlchemy option also makes it easy to work with pandas dataframes.
However, if you need to establish a connection to Denodo using Kerberos or OAuth2 authentication, this is not supported by psycopg2 so you will need to switch to one of the options that make use of the Denodo ODBC driver. In this case we recommend the use of the turbodbc library. This option might also provide a minor performance improvement in some scenarios and, if you intend to use pandas dataframes, using the fetchallnumpy option may improve performance a bit more.
The use of pyodbc is less recommended because performance is worse in most scenarios than that obtained using the Denodo Dialect for SQLAlchemy, psycopg2 directly, or turbodbc. Finally, jaydebeapi shows some performance issues in this Python - Denodo configuration that make its use not recommended for connecting to Denodo from Python unless the scenario requires it.
Connection Examples
Prerequisites
If you are using the Denodo Dialect for SQLAlchemy or the psycopg2 adapter directly you do not need to install any Denodo drivers. You just need to install the necessary libraries through a package manager and import them from your python code.
If you are using pyodbc or turbodbc you must install the Denodo ODBC driver in the machine where your Python interpreter is installed in order to be able to establish the connection between Denodo and Python through ODBC.
- If your machine runs Windows, follow the instructions here.
- If your machine runs Linux or other Unix OS, follow the instructions here.
In case you use jaydebeapi you will need to place the Denodo JDBC driver in the machine where your Python interpreter is installed in order to be able to establish the connection between Denodo and Python through JDBC.
Example Code: Denodo Dialect for SQLAlchemy
In the following example we use Denodo Dialect for SQLAlchemy to fetch data from a Denodo virtual database. The main benefit of using Denodo Dialect for SQLAlchemy is that you don’t need to install the Denodo ODBC driver in the target machine and you do not rely on an external DSN. Also, if you want to use SQLAlchemy-based tools like Apache Superset you need this dialect.
Note that the Denodo Dialect for SQLAlchemy will transparently install and make use of psycopg2 as a dependency.
## Script name: sqlalchemy-denodo-connection.py engine=dbdriver.create_engine(cnxn_str) result_set=engine.execute(query) |
Example Code: psycopg2
In the following example we use psycopg2 (directly) to fetch data from a Denodo virtual database. As with the SQLAlchemy dialect, a benefit of using psycopg2 directly is that you don’t need to install the Denodo ODBC driver in the target machine and you do not rely on an external DSN. But in this case you will not benefit from the features SQLAlchemy offers.
## Script name: psycopg2-connection.py |
Example Code: DSN-less turbodbc
In the following example, we connect to a Denodo server using the turbodbc library in DSN-less mode. Note that the additional feature of putting the results into a numpy array requires turbodbc 3.3.0.
## Script name: turbodbc-dsnless-connection.py ## Default port for ODBC connections
denodoserver_uid = "tpcds_usr" denodoserver_pwd = "tpcds_usr" ## An interesting feature of turbodbc is that it can return the data ## already in numpy MaskedArray, one per column. This will: ## 1. speed up the data fetch process. ## 2. dramatically speed up further calculations on the input data, in ## particular the transformation to a pandas DataFrame. ## 3. lower by a fairly large factor the memory footprint of the loaded Python ## object results_array = cur.fetchallnumpy() # >> type(results_array) # collections.OrderedDict # >> results["sr_returned_date_sk"] # masked_array(data=[2451794, 2452608, 2452694, ..., 2452388, 2452548, # 2452649], # mask=[False, False, False, ..., False, False, False], # fill_value=999999, # dtype=int64) # >> type(results["sr_returned_date_sk"]) # numpy.ma.core.MaskedArray |
Example Code: pyodbc with DSN
In the following example, we connect to a Denodo server using the pyodbc library and by specifying a DSN (Data Source Name).
When choosing to use a DSN-based or a DSN-less connection, take into account that by using a DSN-based connection we keep the connection script cleaner as we hide connection parameters and complexity to the final users. However DSN definition, maintenance and modification are typically a sysadmin task, meaning that their use may slow down development iterations in early phases of the projects.
## Script name: pyodbc-dsn-connection.py |
Example Code: DSN-less pyodbc
In the following example, we connect to a Denodo server with a DSN-less connection using the pyodbc library.
## Script name: pyodbc-dsnless-connection.py |
Example Code: jaydebeapi
In the following example, we connect to a Denodo server using the jaydebeapi library and the Denodo JDBC driver.
To be able to establish the connection between Denodo and Python the JDBC driver must be accessible from the machine where the Python interpreter is installed and a Java interpreter must be installed in the same machine.
## script name: jaydebeapi-connection.py # This is the standard port for jdbc connections denodoserver_database = "distributed_tpcds" denodoserver_uid = "tpcds_usr" denodoserver_pwd = "tpcds_usr" denododriver_path = "/opt/denodo/9.0/tools/client-drivers/jdbc/denodo-vdp-jdbcdriver.jar" |
ODBC/JDBC with Kerberos Authentication
Prerequisites
The Virtual DataPort Server must be configured to accept Kerberos Authentication. This setup is explained step-by-step in the Kerberos Authentication section of the documentation.
Also if you are using Denodo 8 without any update or Denodo 7, to be able to connect with Kerberos to a database via ODBC you have to enable the option Use Kerberos/OAuth authentication for ODBC connections, please see here for instructions and some considerations about this configuration option. If you are on update 20210209 of Denodo 8 or later, you are not concerned by this.
In the case of JDBC, connection details and configuration are explained in this documentation page.
Assuming here that the data fetching code is to be run in a Linux machine, you will need to have a ticket cache with the credentials used to connect to Virtual DataPort.
You can check whether the ticket cache already exists in your machine.
$ klist |
This output indicates that the cache does not exist yet.
$ klist |
This output indicates that the cache already exists in file /tmp/krb5cc_1000, for the user principal HTTP/john@LOCALENV.COM
The ticket cache can be created with the command:
$ kinit -k -t john.keytab HTTP/john@LOCALENV.COM |
With john.keytab being the keytab file and HTTP/john@LOCALENV.COM the user principal.
Example Code: DSN-less turbodbc with Kerberos
In this code snippet, we are connecting to the Virtual DataPort Server with a Kerberos user principal, without specifying user and password. The only modification to the previous DSN-less connection code consists in passing a string to the connect method and that we are specifying UseKerberos=1. Make sure you review the section of the relevant documentation section on ODBC DSN-less connection with Kerberos.
## Script name: turbodbc-dsnless-connection-kerberos.py |
Example Code: jaydebeapi with Kerberos
The only difference with the code snippet seen before explaining the use of jaydebeapi without Kerberos is that we are now passing the Kerberos related parameters (useKerberos, useTicketCache and ticketCache) instead of user and password in the connection uri.
To know more about access to Denodo via JDBC with Kerberos, please check this documentation page.
## script name: jaydebeapi-connection-kerberos.py
|
Getting the data into a pandas dataframe
In this section, we will see how the results imported in python can be transformed to a Pandas dataframe, the de-facto reference data structure to perform data transformation, cleaning and analysis in Python.
Note that the types_dic portion is subject to change depending on the types present in the result set from the source.
## Script name: results-to-pandas-dataframe.py |
You can do the cursor creation, query execution, fetching and transform to pandas dataframe in a single instruction, that is pd.read_sql.
To understand the usage of this function, compare this code snippet.
cnxn = dbdriver.connect(connection_string=cnxn_str) |
With the approach used above:
cnxn = dbdriver.connect(connection_string=cnxn_str) cur = cnxn.cursor()
|
By default pandas performs a fetchall when using read_sql but offers the possibility of reading in chunks. Here is an example of SQLAlchemy with pandas reading in chunks
## Script name: sqlalchemy-denodo-connection.py ## Importing pandas import pandas as pd engine=dbdriver.create_engine(cnxn_str) connection = engine.connect().execution_options( stream_results=True, max_row_buffer=1000) for df in pd.read_sql(query, connection.connection, chunksize=1000): display(df) connection.close() |
Appendix
Password Encryption
In many of the above scripts we used a plain password to connect to the Denodo Server. This may be acceptable for development and testing purposes but it is often forbidden in more security-sensitive installations such as production environments.
One of the possible workarounds for this is using the pycryptodome package. Provided that pycryptodome package is installed, the following code encrypts and saves the password for the “tpcds_usr” user:
from Crypto.Cipher import AES |
And the following code decrypts the password with the given key. The pwd variable is to be used in the dbdriver.connect function
## Read the key and the encrypted password then decrypt it |
Python Environment
A test Python environment may be set up with the following steps, that will configure a working environment when deployed in the conditions listed in section Software Versions.
Installation of the required system packages:
sudo apt install g++ python3-dev unixodbc unixodbc-dev libboost-all-dev openjdk-14-jdk virtualenv |
Initialization of a Python virtual environment:
mkdir /opt/pyvenv/ |
Activation and preparation of the Python virtual environment with the required modules:
cd /opt/pyvenv/denodo-to-python |
Where requirements.txt is the environment file, that may copied and pasted from here:
asttokens==2.4.1 comm==0.2.1 debugpy==1.8.1 decorator==5.1.1 denodo-sqlalchemy==20230818 exceptiongroup==1.2.0 executing==2.0.1 greenlet==3.0.3 importlib-metadata==7.0.1 ipykernel==6.29.3 ipython==8.22.1 JayDeBeApi==1.2.3 jedi==0.19.1 JPype1==1.5.0 jupyter_client==8.6.0 jupyter_core==5.7.1 matplotlib-inline==0.1.6 nest_asyncio==1.6.0 numpy==1.26.4 packaging==23.2 pandas==2.2.1 parso==0.8.3 pexpect==4.9.0 pickleshare==0.7.5 pip==24.0 platformdirs==4.2.0 prompt-toolkit==3.0.42 psutil==5.9.8 psycopg2-binary==2.9.9 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow==15.0.0 pycryptodome==3.20.0 Pygments==2.17.2 pyodbc==5.1.0 python-dateutil==2.8.2 pytz==2024.1 pyzmq==25.1.2 setuptools==68.2.2 six==1.16.0 SQLAlchemy==1.4.49 stack-data==0.6.2 tornado==6.4 traitlets==5.14.1 turbodbc==4.11.0 typing_extensions==4.10.0 tzdata==2024.1 wcwidth==0.2.13 wheel==0.41.2 zipp==3.17.0 |
For both the operating system packages, installed via apt, and the python packages, installed via pip , of course you will need them all only when you want to access Denodo views with all the methods exposed in this article. More realistically you will choose your access method and install the minimum set of packages that allows its usage: for example openjdk is only needed for jaydebeapi, while unixodbc and unixodbc-dev are only needed for ODBC-based methods.
Tested Software Versions
Python |
3.12.2 |
Denodo & Denodo JDBC/ODBC drivers |
9.0 |
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.