Using the Databricks API and DBFS for Bulk Loads¶
You can use the Databricks API to perform bulk data loads.
Important
It is easier to set-up but it is significantly slower than the alternative described in section Databricks, when loading more than a few million rows.
To do it, follow these steps on the host where Virtual DataPort runs.
Install Python 3.
Install the Python packages manager (PIP). It is included by default in Python version 3.4 or higher. If you do not have PIP installed, you can download and install it from this page.
Execute this command on the command line to install the Databricks client:
pip install databricks-cli
Set up the Databricks client authentication. You need to get a personal access token to configure the authentication. Then, execute the following command using the same user account used to start Virtual DataPort:
databricks configure --token
This command will create a
.databrickscfg
configuration file in the user’s home directory.Execute the following command to check if all work fine:
dbfs ls
In Design Studio, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:
Databricks executable location: enter
dbfs
. This is the path to the Databricks client (dbfs
) and because usually, this utility is in the system PATH, you do not need to enter the full path of the file.DBFS URI: URI to which Virtual DataPort will upload the data file. For example:
dbfs://user/databricks/warehouse
Server time zone: select the time zone of the Databricks server.
Table format: select the file format used to create the tables in Databricks.
At runtime, when a query involves a bulk load to Databricks, Virtual DataPort does two things:
It uploads the temporary data files to the Databricks file system (DBFS). To do this, it executes the command
dbfs cp ...
locally, on the host where the Virtual DataPort server runs.It creates the tables in Databricks associated to the data uploaded in the step #1.
Mounting External File Systems on the DBFS¶
Databricks provides its own file system. You can use it to store the data of your tables. You can also use external object storage like “AWS S3 buckets”, “Azure Blob Storage”, “Azure Data Lake”, etc. Mounting object storage to DBFS allows you to access objects in object storage as if they were on the DBFS.
Mounting an AWS S3 Bucket
This section explains how to access an AWS S3 bucket by mounting it on the DBFS. The mount is a pointer to an AWS S3 location.
Get your AWS access and secret keys (section Access Keys (Access Key ID and Secret Access Key)).
Go to your Databricks instance website:
https://<my_databricks_instance>.cloud.databricks.com
Create a new Python notebook. To do that click Create a Blank Notebook.
Copy and paste the following Python notebook:
ACCESS_KEY = "<aws-access-key>" SECRET_KEY = "<aws-secret-key>" AWS_BUCKET_NAME = "<aws-s3-bucket-name>" ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F") MOUNT_NAME = "s3_bucket" dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME) display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
Set the value of your access key, secret keys, and S3 bucket name.
Click on
Run All
.
After running the notebook, your S3 bucket will be mounted in the following path: dbfs:/mnt/s3_bucket/
.
Note
The S3 bucket and the Databricks instance have to run in the same AWS region; otherwise, the process of inserting the data into Databricks will be much slower.
Unmounting an AWS S3 Bucket
To unmount an AWS S3 bucket, create a new Python notebook with the following content:
MOUNT_NAME = "s3_bucket" dbutils.fs.unmount("/mnt/%s" % MOUNT_NAME)
Note
You can mount other external object storages. For more information visit the following link.