BigQuery¶
Follow these steps to configure a BigQuery data source to perform bulk data loads:
When Virtual DataPort runs on Windows, this feature requires the Microsoft Visual C++ 2013 Redistributable Package. If you are unsure if it is installed, download it and install it anyway. The installer of this package will detect if it is already installed.
Log in to Virtual DataPort, edit the JDBC data source and click the tab Read & Write. Select Use bulk data load APIs and provide this information:
If in the tab Connection, Authentication is Service-based OAuth Authentication do this:
In the GS URI text field, enter the URI to which the data source will upload the data file. For example:
gs://bucket-multiregion-us/user/admin/data/
In Private key, select the JSON file that contains your service account key.
For the other authentication methods, do this:
In the File System drop-down, select the file system that will be used to upload the data files. In this case, the only option available is HDFS.
In HDFS URI, enter the URI to which the data source will upload the data file. For example:
gs://bucket-multiregion-us/user/admin/data/
In the Hadoop properties section, set the Hadoop properties used in Hadoop configuration files like
core-site.xml
.
Set the following Hadoop properties:
google.cloud.auth.service.account.enable = true
google.cloud.auth.service.account.json.keyfile = < path of the JSON file that contains your service account key>
At runtime, when a query involves a bulk load to BigQuery, Virtual DataPort uploads the data in chunks and in parallel instead of generating a single file with all the data and once is completely written to disk, transfer it to BigQuery.
Let us say that you execute a query to load the cache of a view and that the cache database is BigQuery. The process of loading the cache is the following:
The VDP server connects to BigQuery using the credentials of the JDBC data source and executes a create table statement.
As soon as VDP obtains the first rows of the result, it starts writing them to the data file. This file is written to disk in Parquet format.
Once the Server writes 5,000,000 rows into the file, it closes the file and begins sending it to the location defined in the BigQuery data source. Simultaneously, it writes the next rows into another file.
5,000,000 is the default value of the field Batch insert size (rows) of the Read & Write tab of the data source and it can be changed. Note that increasing it will reduce the parallelism of the upload process.
The VDP server connects to Google Cloud Storage using the credentials from the JSON file referenced in the Hadoop properties.
When the Server finishes writing another data file (i.e. the files reach the batch insert size), it begins to transfer it to Google Cloud Storage (GCS), to the location indicated in the field GS URI (or HDFS URI) even if the files generated previously have not been completely transferred. By default, the Server transfers up to 10 files concurrently per query. When it reaches this limit, the Server keeps writing rows into the data files, but does not transfer more files.
This limit can be changed by executing the following command from the VQL Shell:
SET 'com.denodo.vdb.util.tablemanagement.sql.insertion.DFInsertWorker.hdfs.maxParallelUploads' = '<new limit per query>';
Once all the files are uploaded, the created table is loaded with this data in BigQuery. The cache table will contain then the cached data.
Note
If a different catalog/schema needs to be used for the caching or data movement process, we can see how just one catalog appears into the drop-down list because we have to include the project name (the project is the catalog in bigquery) into the connection URI when creating the data source and that one will be the only that will appear into the drop-down list of the Read & Write tab.