Databricks

To configure a Databricks data source to perform bulk data loads, follow the same process described for Spark. This is the way recommended by Databricks.

Alternatively, you can use the Databricks API to perform bulk data loads. To do it, install the Databricks client on the host where Virtual DataPort runs. To do it, follow these steps:

  1. Install Python 3.

  2. Install the Python packages manager (PIP). It is included by default in Python version 3.4 or higher. If you do not have PIP installed, you can download and install it from this page.

  3. Install the Databricks client. To do this, open a command line and execute the following command:

    pip install databricks-cli
    
  4. Set up the Databricks client authentication. You need to get a personal access token to configure the authentication. Then, run the following command in the command line:

    databricks configure --token
    
  5. Execute the following command to check if all work fine:

    dbfs ls
    
  6. If Virtual DataPort runs on Linux, go to the next step. If it runs on Windows, check if the environment variable HADOOP_HOME is defined on this computer. To see the list of environment variables, open a command line and execute SET.

    If this environment variable is already defined, copy the content of the directory <DENODO_HOME>\dll\vdp\winutils to %HADOOP_HOME%\bin.

    If the environment variable is undefined, do this:

    1. Create a directory. For example, <DENODO_HOME>\hadoop_win_utils.

    2. Create a directory called bin within the new directory. For example, <DENODO_HOME>\hadoop_win_utils\bin.

    3. Define the environment variable HADOOP_HOME to point to <DENODO_HOME>\hadoop_win_utils.

    4. Copy the content of the directory <DENODO_HOME>\dll\vdp\winutils to %HADOOP_HOME%\bin.

    This is necessary because, during the bulk load process, the libraries that generate these files, invoke %HADOOP_HOME%\\bin\\winutils.exe when running on Windows.

  7. On the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:

    • Databricks executable location: enter dbfs. This is the path to the Databricks client (dbfs) and because usually, this utility is in the system PATH, you do not need to enter the full path of the file.

    • DBFS URI: URI to which Virtual DataPort will upload the data file. For example: dbfs://user/databricks/warehouse

    • Server time zone: select the time zone of the Databricks server.

    • Table format: select the file format used to create the tables in Databricks.

Databricks bulk data load configuration

Databricks bulk data load configuration

At runtime, when a query involves a bulk load to Databricks, Virtual DataPort does two things:

  1. It uploads the temporary data files to the Databricks file system (DBFS). To do this, it executes the command dbfs cp ... locally, on the host where the Virtual DataPort server runs.

  2. It creates the tables in Databricks associated to the data uploaded in the step #1.

Mounting External File Systems on the DBFS

Databricks provides its own file system. You can use it to store the data of your tables. If you want you can also use external object storage like AWS S3 buckets, Azure Blob Storage, Azure Data Lake, etc. Mounting object storage to DBFS allows you to access objects in object storage as if they were on the DBFS.

Mounting an AWS S3 Bucket

This section explains how to access an AWS S3 bucket by mounting it on the DBFS. The mount is a pointer to an AWS S3 location.

  1. Get your AWS access and secret keys (section Access Keys (Access Key ID and Secret Access Key)).

  2. Go to your Databricks instance website: https://<my_databricks_instance>.cloud.databricks.com

  3. Create a new Python notebook. To do that click on Create a Blank Notebook.

  4. Copy and paste the following Python notebook:

    ACCESS_KEY = "<aws-access-key>"
    SECRET_KEY = "<aws-secret-key>"
    AWS_BUCKET_NAME = "<aws-s3-bucket-name>"
    ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
    MOUNT_NAME = "s3_bucket"
    
    dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
    display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
    
  5. Set the value of your access key, secret keys, and S3 bucket name.

  6. Click on Run All.

After running the notebook, your S3 bucket will be mounted in the following path: dbfs:/mnt/s3_bucket/.

Note

The S3 bucket and the Databricks instance have to run in the same AWS region; otherwise, the process of inserting the data into Databricks will be much slower.

Unmounting an AWS S3 bucket

To unmount an AWS S3 bucket, create a new Python notebook with the following content:

MOUNT_NAME = "s3_bucket"
dbutils.fs.unmount("/mnt/%s" % MOUNT_NAME)

Note

You can mount other external object storages. For more information visit the following link.