The Data Movement optimization provides a way to execute federated queries more efficiently. When a query involves two views and one of them is much larger than the other, Virtual DataPort can transfer the data of the smaller view into the data source of the larger view and execute the operation in the second data source. This may offer great performance improvements.
The performance of the following operations can be improved with this optimization:
Let us consider the following example:
We have a view product in the data source DS1 that contains one million rows. This view has an attribute called category. There are one thousand products in the category “electronics”.
We have a view sale in the data source DS2 that contains one billion rows. The view sale has an attribute product_id that indicates the product sold in a particular sale. Therefore, on average, we have 1000 sales for each product.
The following query obtains the amount of sales of products of the “electronics” department:
SELECT SUM(s.amount) FROM Product p JOIN Sale s ON p.id = s.product_id WHERE p.category = 'electronics'
The usual strategy for this JOIN should be NESTED. In this case, Denodo would automatically use the optimization of the NESTED join that groups several ids of the outer relation (product) when querying the inner relation (sale). More precisely, these would be the execution steps (see Data movement: nested join operation):
Denodo executes in
DS1a query like
SELECT id FROM product WHERE p.category = 'electronics'
This query will return 1000 rows.
Denodo groups the 1000
product_idsobtained in step 1 into five groups, each one having 200 ids. For each group, it executes on DS2 a query like
SELECT s.amount FROM Sale WHERE product_id IN (id1,...,id200)
Step 2 will produce one million rows (200,000 for each group) that are transferred from DS2 to Denodo
Denodo post-processes the one million rows obtained in step 3 to obtain the total amount.
By using the Data Movement optimization, the following alternative is possible (see Data movement: join optimized with Data movement):
Denodo queries the product view with the condition
c.category = 'electronics'. This query will return 1000 rows.
Denodo inserts the 1000 rows obtained in step1 in a table (
TEMP_PROD) in DS2.
Now, Denodo can delegate the entire query to DS2, to obtain the final result, which consists of a single row:
SELECT SUM(s.amount) FROM tempProduct p JOIN Sale s ON p.id = s.product_id WHERE p.category = 'electronics'
The target of a data movement can be one of these:
- A JDBC data source involved in the view and whose database is one of the DBMSs supported by the Cache Engine (the section Cache Module lists them).
- A JDBC data source with the adapter Hive 2.0.0.
- The database used by the Cache Engine:
This condition ensures that Virtual DataPort has the correct mappings over the column types and knows how to create tables correctly and insert data in them.
Besides, the data source configuration must contain all the necessary information to be able to create a table in it. This means that the connection URI has to contain the database name.
To create a view with a Data Movement you need the following privileges:
- Read privilege over all the views to be moved. That is, the ones
specified in the
DATAMOVEMENTPLANproperty of the
- Read and write privileges over the target data source. To have them, the
user that executes the query must be one of these:
- The data source owner
- An Administrator user
- Or, a Local Administrator of the Virtual DataPort database
The Data movement of a view is defined in the Execution plan tab of the “Options” dialog of the view.
In the Defining the data movement of a view, there is a view
full_order_info with two data movements defined. When a user queries
this view, the execution engine will create two tables in the
database of source
ds_jdbc_vertica. Then, it will retrieve the data
of the view
employee and insert it in one of the tables in
ds_jdbc_vertica; and it will the data from
product in the other
table. After this, the execution engine will be able to
delegate the two joins to
Once the query finishes, Virtual DataPort will delete the
To enable the movement of data of a view, select the check box Enable data movement and, in the box below, select the data source that represents the database where the table will be created.
Options of the CONTEXT Clause that Control a Data Movement¶
CONTEXT clause of the queries has three clauses that control data movements:
offand the execution engine is going to perform a data movement for this query, the execution engine will not use the bulk load API of the target database. Instead, it uses INSERT statements to insert the data.
on. If the target data source is configured to use the bulk load API of the data source, the execution engine uses it by default.
trueand the execution engine uses the bulk load API of the database to perform a data movement, once the data movement finishes, it will remove the delimited files generated to perform this operation. If
false, it does not remove these files.
This property only affects the data movements to databases for which the data is written to a temporary file before loading it. For example, Denodo performs data movements to HP Vertica without writing the data to disk. Therefore, this property does not modify the behavior of the data movement. The section Bulk Data Load has a subsection for each database that explain if the execution engine creates a temporary data file or not.
Note that the files generated for data movements contain all the data to be inserted. Therefore, we do not recommend setting this property to
falseif the data movement of a query involves millions of rows and it is executed very often (for example, from a job of Denodo Scheduler). As the files will use a lot of space, the hard drive could be filled with the files generated for each execution of the query.
false, the execution engine uses the bulk load API of the database to perform a data movement, and the data movement fails, the execution engine will not remove the delimited files generated to perform this operation. If the data movement succeeds, the execution engine removes these files.
true. With this value, the execute engine will delete these files regardless of if the statement fails or not, unless you set the property
This option is useful to debug problems when setting up a job in Denodo Scheduler that involves data movement. It will allow you to debug issues in jobs that fail but without having to keep the data files of queries that succeed.