You can translate the document:

Introduction

This document provides a step-by-step guide for connecting to Denodo Virtual DataPort from Amazon SageMaker using SageMaker notebooks, so users can consume governed, high-performance data directly from the ML development environment.  For Data Scientists, Denodo reduces friction in data preparation by unifying and virtualizing disparate data sources, letting users spend less time on ETL plumbing and more time solving the actual problems that matter.

Amazon SageMaker is a fully managed cloud service that provides developers and data scientists a suite of tools for the machine learning workflows. For these workflows, Denodo  accelerates the AI development and training lifecycle by providing governed access to the entire enterprise data landscape, delivering faster business value from tailored ML and AI experiences.

Creating a Notebook From Sagemaker AI

An Amazon SageMaker notebook instance is a specialized Amazon EC2 compute resource configured to run the Jupyter Notebook application. These instances provide a managed environment for data science and machine learning operations, including data preprocessing, model architecture development, and deployment orchestration.

As a first step we are going to create a notebook instance within Sagemaker AI.

  • Navigate to Amazon SageMaker AI > Notebook Instances.
  • Click on Create notebook instance.

  • Name the instance, for example DenodoInstance.
  • Choose the required instance type from the available options, for this example, an ml.t3.medium instance is used.

  • You may leave the Additional configuration section options as it is as they are optional.
  • Next, configure the Permissions and encryption for the notebook instance.
  • Choose the IAM Role that has permissions to access other services like S3, EC2 etc. Note that you can create a role or let SageMaker create one for you with the AmazonSageMakerFullAccess IAM policy attached.

  • Based on the requirement, you may decide whether to give root access for the users accessing the notebook.
  • Scroll down to configure Network options. SageMaker allows you to configure the instance within a VPC. This will be useful in case the Notebook instance needs to be accessed only within the VPC. Note that if the VPC is configured to access the internet, then the instance will inherit it. For this example, we choose No VPC and SageMaker will provide internet access directly to the instance.

  • The Git repositories section allows the instance to start Jupyter in the given repository.
  • Click on Create Notebook Instance to start the instance creation.

Access Jupyter / JupyterLab

Amazon SageMaker creates Jupyter Notebook instances from which we can create notebooks and store them. AWS offers several pre-built notebooks for python libraries for AI/ML workloads.  For this example, we will create a simple notebook and install the required Python libraries for establishing a connection to Denodo.

This document explains the list of libraries required and we must install them by accessing the Terminal of the instance. We do have an option to install the libraries from the notebooks, but the Terminal option offers more control for the users and offers persistence.

From the Jupyter page, create a new Notebook by clicking “File > New > Notebook

Connect Denodo to Sagemaker using Arrow Flight SQL

As of Denodo 9.1, The VDP includes an Arrow FlightSQL interface for applications such as Sagemaker Notebooks that can leverage Apache Arrow.  Some key benefits for users over legacy industry standard drivers are:

  • Columnar Data Transfers
  • More Efficient Memory Usage
  • Parallel Data Transfers

Although direct connections via the ADBC flight driver are also fully supported, we will focus on using Denodo dialect for SQLAlchemy for this guide. Utilizing SQLAlchemy over the base driver gives users essential abstractions that simplify VQL syntax and execution.

For more information on how to use SQLAlchemy with the Arrow driver, please refer to our dedicated guides on building Python notebooks with Denodo.

Installing Dependencies

Once a notebook instance has been created, the next step is to install the denodo-sqlalchemy python library for establishing a connection to Denodo.

To do so, navigate to File > New > Terminal and run the following command:

pip install denodo-sqlalchemy[flightsql]

Using the Terminal option, the library will be persisted on restart whereas other options will lose the library if the notebook is stopped/restarted. Once installed, you can run a pip list command that retrieves the list of libraries installed on the instance.

Additionally for this guide, we will be using jupysql  to simplify the sql calls.

pip install jupysql

Connecting to Denodo

Before we can run queries against our Denodo from Python, we need to initialize a connection object. We’ll use the Denodo Dialect for Flight SQL driver library. Unlike traditional drivers, ADBC is designed for high-performance columnar data transfer, making it the recommended choice for machine learning and AI in your SageMaker environment.

To build a connection, we will first need to define a connection string, defining how the driver should locate Denodo. The connection string has the following format.

denodo+flightsql://<username>:<password>@<host>:<port[9994]>/<database>

By default this will connect to the 9994 port in Denodo.

import sqlalchemy

import denodo.sqlalchemy as denodo_sqlalchemy

uri=denodo+flightsql://<username>:<password>@<host>:<port[9994]>/<database>

engine = sqlalchemy.create_engine(uri)

For more information on handling headers, tokens, and configuration properties for the driver go to our official documentation .

Execute the Denodo Dialect for SQLAlchemy

Once the SQLAlchemy engine is initialized, jupiter magic cells will let us transition from writing Python strings to an interactive SQL experience using JupySQL. This extension allows you to execute raw SQL directly in notebook cells while maintaining the high-performance backing of the Arrow Flight SQL driver.

By registering our connection object with jupysql, Notebooks allows us to eliminate boilerplate code and enable a "SQL-first" workflow for data exploration and rapid prototyping. To do this execute the following commands.

%load_ext sql 

And then append the engine object to %SQL so it knows where to do future calls

%sql engine

Now you can execute queries to Denodo using the %sql cell identifier.

Summary

Integrating Denodo with SageMaker via Arrow Flight SQL and SQLAlchemy allows Data scientists to detach the complexity of traditional data plumbing, giving a streamlined, high-performance interface to the entire enterprise. With SQLAlchemy abstractions and JupySQL cell magics, users have the foundation for scalable, data-driven AI/ML workflows that turn raw information into competitive intelligence faster than ever before.

References

Using Notebooks for Data Science with Denodo

How to connect to Denodo from Python - a starter for Data Scientists

Denodo in Data Science and Machine Learning Projects

Access using Flight SQL

Disclaimer

The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.

For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.
Recommendation

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here