To configure a Databricks data source to perform bulk data loads, follow the same process described for Spark. This is the way recommended by Databricks.
Alternatively, you can use the Databricks API to perform bulk data loads. To do it, install the Databricks client on the host where Virtual DataPort runs. To do it, follow these steps:
Install Python 3.
Install the Python packages manager (PIP). It is included by default in Python version 3.4 or higher. If you do not have PIP installed, you can download and install it from this page.
Install the Databricks client. To do this, open a command line and execute the following command:
pip install databricks-cli
Set up the Databricks client authentication. You need to get a personal access token to configure the authentication. Then run the following command in the command line:
databricks configure --token
Execute the following command to check if all work fine:
Only on Windows:
winutils.exebinary from a Hadoop distribution. Only the
winutilsfiles are mandatory. The rest of Hadoop distribution is not required.
Check that the file
<HADOOP_HOME>/bin/winutils.exeexists. Otherwise, the bulk data load will fail.
Then, on the administration tool, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and:
In the Databricks executable location box, enter the path to the file Databricks client. This executable is included in the system path by default. In this case, just enter
dbfsin this field.
In DBFS URI, enter the URI to which Virtual DataPort will upload the data file. For example:
In the Server time zone combo box, select the time zone of the Databricks server.
In the Table format combo box, select the file format used to create the tables in Databricks.
At runtime, when a query involves a bulk load to Databricks, Virtual DataPort does two things:
It uploads the temporary data files to the Databricks file system (DBFS). To do this, it executes the command
dbfs cp ...locally, on the host where the Virtual DataPort server runs.
It creates the tables in Databricks associated to the data uploaded in the step #1.
Mounting external file systems on the DBFS¶
Databricks provides its own file system. You can use it to store the data of your tables. If you want you can also use external object storage like AWS S3 buckets, Azure Blob Storage, Azure Data Lake, etc. Mounting object storage to DBFS allows you to access objects in object storage as if they were on the DBFS.
Mounting an AWS S3 buckets
This section explains how to access AWS S3 buckets by mounting it on the DBFS. The mount is a pointer to an AWS S3 location. You have to do the following steps to mount an AWS S3 bucket:
Get your AWS access and secret keys (section Access Keys (Access Key ID and Secret Access Key)).
Go to your Databricks instance website:
Create a new Python notebook. To do that click on
Create a Blank Notebook.
Copy and paste the following Python notebook:
ACCESS_KEY = "<aws-access-key>" SECRET_KEY = "<aws-secret-key>" AWS_BUCKET_NAME = "<aws-s3-bucket-name>" ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F") MOUNT_NAME = "s3_bucket" dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME) display(dbutils.fs.ls("/mnt/%s" % MOUNT_NAME))
Set the value of your access key, secret keys, and S3 bucket name.
After running the notebook, your S3 bucket will be mounted in the following path:
The S3 bucket and the Databricks instance have to run in the same AWS region; otherwise, the process of inserting the data into Databricks will be much slower.
Unmounting an AWS S3 buckets
To unmount an AWS S3 bucket, create a new Python notebook with the following content:
MOUNT_NAME = "s3_bucket" dbutils.fs.unmount("/mnt/%s" % MOUNT_NAME)
You can mount other external object storages. For more information visit the following link.