Hadoop-Compatible Storage¶
There are several use cases that might require accessing an object storage. For example:
Access data in Parquet or Delta format using the embedded MPP
Access DF/JSON/XML/Excel data sources
Configure that storage for Bulk Data Load
If the object storage is different from HDFS, S3 or ADLS Gen2 but it is compatible with the Hadoop API you can still access to it by selecting the HDFS option and specifying the right hadoop properties. For example, you can work with Azure Blob File System or Google Cloud Storage. The steps to use these routes are:
Configure the connection according to the Azure Blob File System type:
URI syntax:
wasb://<container>\@<account_name>.blob.core.windows.net/<path>/<file_name>
.
Note
For data sources DF/JSON/XML/Excel using HDFS paths, the “@” character in the URI must be escaped as shown in the example to avoid the confusion with an environment variable. This does not apply if the configuration is for Bulk Data Load or Object Storage data in Parquet and Delta Format.
Configure the following hadoop properties. In the Hadoop documentation you can check the available methods and the properties to configure them. Here is an example with shared key:
Name
Value
fs.azure.account.key.<account_name>.blob.core.windows.net
<Access key>
fs.azure.always.use.ssl
false
Note
SSL usage can be triggered by setting the property
fs.azure.always.use.ssl
totrue
or by accessing the resource from a route like thiswasbs://<container>\@<account_name>.blob.core.windows.net/<path>/<file_name>
(in this alternative the property should be removed).
Configure the connection according to the Google Cloud Storage type:
URI syntax:
gs://<bucket>/<path>/
.Configure the following hadoop properties. In the Hadoop documentation you can check the available methods and the properties to configure them. Here is an example with JSON keyfile service account authentication:
Name
Value
google.cloud.auth.service.account.enable
true
google.cloud.auth.service.account.json.keyfile
<JSON keyfile path>
fs.gs.impl.disable.cache
true
For data sources DF/JSON/XML/Excel using HDFS paths, in the authentication section select None.