You can translate the document:

Introduction

This document shows how to configure Databricks as an MPP and Cache database. Denodo 7.0 Update 20181011 or later is required for this integration.

Configure Databricks

Create a Databricks instance

  • In Azure Only: Create an Azure Databricks instance using Premium (in other cases there will be no JDBC access).
  • Create a cluster.
  • Create a JDBC notebook to execute queries.

Configure your storage in Azure/AWS

Since Databricks runs on AWS/Azure, it will use their storage systems. Therefore, instead of HDFS, Databricks File System (DBFS) will use S3 in AWS and Azure Data Lake (ADL) or Azure Blob Storage (WABS) in Azure.

To enable the interaction with those systems, we will need to perform two steps:

  • Configure the storage folder.
  • Configure authentication, in the case of Azure.

Design a folder for storage and mount it in the DataBricks file system

We need a path that matches both on DBFS and in the  S3/ADL/WABS access that Denodo will use to upload the content. Since DBFS will only allow to mount folders in the /mnt folder, we need to perform the following steps:

  1. In your bucket (S3) or blob container (Azure), create the folder /mnt/denodo_mppcache.
  2. In Databricks, mount the folder /mnt/denodo_mppcache from AWS S3 or Azure ALD/WABS into /mnt/denodo_mppcache in DBFS.  
  1. For AWS (more details here for S3 )
  • create a Scala notebook and mount S3 with

val AccessKey = "<aws-access-key>"

// Encode the Secret Key as that can contain "/"

val SecretKey = "<aws-secret-key>"

val EncodedSecretKey = SecretKey.replace("/", "%2F")

val AwsBucketName = "<aws-bucket-name>"

val MountName = "<destination-mount-folder-name-in-dbfs>"

dbutils.fs.mount(s"s3a://$AccessKey:$EncodedSecretKey@$AwsBucketName", s"/mnt/$MountName")

display(dbutils.fs.ls(s"/mnt/$MountName"))

  1. For Azure (more details here for WASB)
  • create a Scala notebook and mount WASB with

dbutils.fs.mount(

  source = "wasbs://<blob container>@<storage account>.blob.core.windows.net/mnt/denodo_mppcache",

  mountPoint = "/mnt/denodo_mppcache",

  extraConfigs = Map("fs.azure.account.key.denodosecache.blob.core.windows.net" -> "<your key>"))

display(dbutils.fs.ls(s"/mnt/$MountName"))

  1. For DBFS  
  • Create a directory to store the data e.g. denodo_mpp

display(dbutils.fs.ls("dbfs:/"))

dbutils.fs.mkdirs("/denodo_mpp/")

display(dbutils.fs.ls("dbfs:/"))

Depending on the installation it may also be needed to add the following configuration property:

spark.conf.set("Fs.azure.account.key.denodosecache.blob.core.windows.net", "Your key")

Obtain the Authentication Tokens

Azure Authentication Tokens

Since Azure uses OAuth 2.0 for authentication, you will need to provide some details in the core-site.xml file of the Hadoop client. Follow the instructions in this link 

AWS Access Key/Secret Key

In order to get your Access Key ID and Secret Access Key follow next steps:

  1. Open the IAM console.
  2. From the navigation menu, click Users.
  3. Select your IAM user name.
  4. Click User Actions, and then click Manage Access Keys.
  5. Click Create Access Key.
  6. Your keys will look something like this:
  1. Access key ID example: AKIAIOSFODNN7EXAMPLE
  2. Secret access key example: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  1. Click Download Credentials, and store the keys in a secure location.

Configure the Hadoop client in the Denodo Server

IMPORTANT: After the update, the Denodo 20190903 DataBricks client is supported so it simplifies the connectivity.

  1. Configure your JAVA_HOME
  1. IMPORTANT: You must use a path with no blank spaces
  1. Download Hadoop 3.0.0 and unzip it into a path that will be our HADOOP_HOME (for example, <DENODO_HOME>/lib-external/hadoop-client/)
  1. IMPORTANT: Use a path with no blank spaces
  1. If you are in Windows, you will also need the Hadoop Windows binaries. You can download them from here:
  1. Use the “Clone or download” link in GitHub to download the whole project, and then use the content of the folder hadoop-3.0.0/bin to overwrite the bin folder obtained in the step above.
  2. If you are in Linux, ignore this step.
  1. Configure the environment variable <HADOOP_HOME> using that path
  2. Add the following libraries to <HADOOP_HOME>/share/hadoop/common depending on your system
  1. For AWS
  1. aws-java-sdk-bundle-1.11.416.jar
  2. hadoop-aws-3.1.1.jar
  3. hadoop-common-3.1.1.jar
  1. For Azure
  1. azure-data-lake-store-sdk-2.3.1.jar
  2. azure-storage-3.1.0.jar
  3. hadoop-azure-3.1.0.jar
  4. hadoop-azure-datalake-3.1.0.jar  
  1. Create a core-site.xml file and put it into <HADOOP_HOME>/etc/hadoop with the credentials information
  1. For AWS (other configurations are possible e.g. use IAM roles as described )

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>fs.defaultFS</name>

    <value>s3a://YOUR_BUCKET_NAME</value>

  </property>

  <property>

    <name>fs.s3a.access.key</name>

    <value>YOUR_AWS_ACCESS_KEY</value>

  </property>

  <property>

    <name>fs.s3a.secret.key</name>

    <value>YOUR_AWS_ACCESS_SECRET</value>

  </property>

  <property>

    <name>fs.AbstractFileSystem.s3a.imp</name>

    <value>org.apache.hadoop.fs.s3a.S3A</value>

  </property>

</configuration>

  1. For Azure (other configurations are possible e.g. use IAM roles as described )

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>fs.defaultFS</name>

<value>wasb://CONTAINER_NAME@STORAGE_ACCOUNT.blob.core.windows.net</value>

  </property>

  <property>

        <name>fs.azure.account.key.STORAGE_ACCOUNT.blob.core.windows.n et</name>

  <value>KEY==</value>

  </property>

  <property>

    <name>fs.AbstractFileSystem.adl.impl</name>

    <value>org.apache.hadoop.fs.azure.Wasb</value>

   </property>

</configuration>

  1. Test that it works executing in a cmd, for example
  1. For AWS on Windows
  1. <HADOOP_HOME>/bin/hadoop fs -put c:/tmp/file.csv s3a://<bucket name>/mnt/denodo_mppcache
  1. For Azure on Windows
  1. <HADOOP_HOME>/bin/hadoop fs -put c:/tmp/file.csv wasb://<blob container>@<storage account>.blob.core.windows.net/mnt/denodo_mppcache
  1. You can also use the Test Bulk Load capability of your MPP data source available in the Denodo Web Design Studio.

 

Configure the Databricks client in the Denodo Server (Denodo update 20190903 or later)

If you are using Denodo update 20190903 or later it is possible to use Databricks client instead of Hadoop’s.

Install

$ pip install databricks-cli

Configure

$ databricks configure --token

Databricks Host (should begin with https://): https://dbc-b1b8888e-0fc5.cloud.databricks.com

Token: dapie5709eb881cb03e84d14d66cebc78888

jdbc:spark://adb-5634774381433770.10.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/5634774381433770/0111-172143-35eg7n7w;AuthMech=3;UID=token;PWD=dapi732f336742f1171365c79e732d47321d-2

ln the Denodo Server

On higher updates of the Denodo Platform server, the Databricks driver is provided you can use the same to connect to Databricks. Hence Step 1 can be skipped. On Denodo 9 use either the databricks or databricks-2 adapter to connect to the Databricks source, both the classpath contains drivers.

  1. Download the Databricks JDBC driver linked on this page. Copy the driver to the <DENODO_HOME>/lib-external/jdbc/databricks folder.
  2. In Denodo, create a new JDBC connection and select your Spark version as your adapter
  1. In the classpath, select the driver downloaded.
  2. In the URI, use something like jdbc:databricks://<region>.azuredatabricks.net:443/default;transportMode=http;ssl=true;httpPath=<your http path>
  1. In the link above you can find more details on how to find your HTTP Path
  1. In the user, use the word “token”.
  2. For the password, use your actual token. You can find more details in the link above
  1. Test that the connection is working.
  2. Go to the Read & Write tab to configure the bulk data loading options.
  3. In the Hadoop executable location, use the path of your Hadoop 3. x libraries (remember that you need the core-site.xml with your auth), for example
  1. <DENODO_HOME>/lib-external/HadoopClient/hadoop-3.0.0/bin/hadoop.cmd
  2. When using Databricks-cli if it is in the path you only need to use the following:
  1. dbfs
  1. In the HDFS URI, use your S3 or WABS route, with the folder previously created at the end, for example:
  1. wasb://<blob container>@<storage account>.blob.core.windows.net/mnt/denodo_mppcache

        For DBFS, use the following URL

  1. dbfs://<path mounted on databricks>
  1. In the server time zone, select the timezone of the server (Etc/GMT+8 for example)
  1. To obtain this value you can execute in Databricks
  1. spark.conf.get('spark.sql.session.timeZone')

Now you are ready to use this data source as your cache. To enable the use of Spark as an MPP accelerator, enable the corresponding options in the data source and in the Optimizer settings.

Performance considerations using Databricks as cache

  • Use Databricks Simba JDBC drivers.
  • Use the Hadoop client to load the data to S3.
  • Databricks and the S3 bucket should be in the same availability zone.
  • Enable the compression of Parquet files: SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.HdfsInsertWorker.parquet.compression'='snappy'
  • Change the batch insert size to take advantage of the parallelism generating and uploading 10-11 files.

Known Limitations

Due to the limitation of Spark external tables, the processes that invalidate the cache are not transactional. Therefore, the following limitations apply:

  • “matching_rows” invalidation does not work. Only adding increments or full refresh (“all_rows”).
  • Do not query the tables during the invalidation process, as they may through an error.
  • Do not attempt concurrent executions of queries that load and invalidate the cache in the same command.

How to Read Parquet Data

For AWS

  1. Upload a parquet file in your S3 bucket


  1. On a Scala Notebook, mount the bucket following the previous section Design a folder for storage and mount it in the DataBricks file system.
  2. Search for the Files in dbfs with the display command


  1. Read and show the parquet files with the command:

val data = sqlContext.read.parquet(s"dbfs:/mnt/denodo_mppcache/mnt/denodo_mppcache")

display(data)

Denodo as a JDBC source in Databricks

  1. Load the Denodo JDBC driver from <DENODO_HOME>/tools/client-drivers/jdbc/ following Create a Workspace library.

  1. Create a connection and execute a query in Notebook following these steps (more info can be found at SQL Databases using JDBC),

val jdbcHostname = "myhost"

val jdbcPort = 9999

val jdbcDatabase = "big_data"

// Create the JDBC URL without passing in the user and password parameters.

val jdbcUrl = s"jdbc:vdb://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"

// Create a Properties() object to hold the parameters.

import java.util.Properties

val connectionProperties = new Properties()

connectionProperties.put("user", "admin")

connectionProperties.put("password", "mypass")

connectionProperties.put("driver", "com.denodo.vdp.jdbc.Driver")

val pushdown_query = "(select * from oracle_customers"

val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)

display(df)

Depending on the Databricks version when following these steps you might get a “No suitable driver” error. In such scenarios, the postgresql driver can be used as a workaround, this driver is loaded by default.

val jdbcHostname = "myhost"

val jdbcPort = 9996

val jdbcDatabase = "big_data"

// Create the JDBC URL without passing in the user and password parameters.

val jdbcUrl = s"jdbc:postgresql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"

// Create a Properties() object to hold the parameters.

import java.util.Properties

val connectionProperties = new Properties()

connectionProperties.put("user", "admin")

connectionProperties.put("password", "mypass")

connectionProperties.put("driver", "org.postgresql.Driver")

val pushdown_query = "(select * from oracle_customer) oracle_customer"

val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)

Console.println(pushdown_query)

display(df)

Disclaimer
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here