Data Movement¶
The Data Movement optimization provides a way to execute federated queries more efficiently. When a query involves two views and one of them is much larger than the other, Virtual DataPort can transfer the data of the smaller view into the data source of the larger view and execute the operation in the second data source. This may offer great performance improvements.
The performance of the following operations can be improved with this optimization:
Join
Union
Minus
Intersect
Let us consider the following example:
We have a view product in the data source DS1 that contains one million rows. This view has an attribute called category. There are one thousand products in the category “electronics”.
We have a view sale in the data source DS2 that contains one billion rows. The view sale has an attribute product_id that indicates the product sold in a particular sale. Therefore, on average, we have 1000 sales for each product.
The following query obtains the amount of sales of products of the “electronics” department:
SELECT SUM(s.amount)
FROM Product p JOIN Sale s ON p.id = s.product_id
WHERE p.category = 'electronics'
The usual strategy for this JOIN should be NESTED. In this case, Denodo would automatically use the optimization of the NESTED join that groups several ids of the outer relation (product) when querying the inner relation (sale). More precisely, these would be the execution steps (see Data movement: nested join operation):
Denodo executes in
DS1
a query likeSELECT id FROM product WHERE p.category = 'electronics'
This query will return 1000 rows.
Denodo groups the 1000
product_ids
obtained in step 1 into five groups, each one having 200 ids. For each group, it executes on DS2 a query likeSELECT s.amount FROM Sale WHERE product_id IN (id1,...,id200)
Step 2 will produce one million rows (200,000 for each group) that are transferred from DS2 to Denodo
Denodo post-processes the one million rows obtained in step 3 to obtain the total amount.
By using the Data Movement optimization, the following alternative is possible (see Data movement: join optimized with Data movement):
Denodo queries the product view with the condition
c.category = 'electronics'
. This query will return 1000 rows.Denodo inserts the 1000 rows obtained in step1 in a table (
TEMP_PROD
) in DS2.Now, Denodo can delegate the entire query to DS2, to obtain the final result, which consists of a single row:
SELECT SUM(s.amount) FROM tempProduct p JOIN Sale s ON p.id = s.product_id WHERE p.category = 'electronics'
The target of a data movement can be one of these:
A JDBC data source involved in the execution of the view and whose database adapter is supported by the Cache Engine (the section Cache Module lists them).
The database used by the Cache Engine:
vdpcachedatasource
orcustomvdpcachedatasource
.
This condition ensures that Virtual DataPort has the correct mappings over the column types and knows how to create tables correctly and insert data in them.
Besides, the data source configuration must contain all the necessary information to be able to create a table in it. This means that the connection URI has to contain the database name.
To create a view with a Data Movement you need the following privileges:
Read privilege over all the views to be moved. That is, the ones specified in the
DATAMOVEMENTPLAN
property of theCONTEXT
clause.Execute privilege over the target data source.
Forcing the Data Movement of a View¶
The Data movement of a view is defined in the Data movement tab of the “Options” dialog of the view.
In the Defining the data movement of a view, there is a view
full_order_info
with two data movements defined. When a user queries
this view, the execution engine will create two tables in the
database of source ds_jdbc_vertica
. Then, it will retrieve the data
of the view employee
and insert it in one of the tables in
ds_jdbc_vertica
; and it will the data from product
in the other
table. After this, the execution engine will be able to
delegate the two joins to ds_jdbc_vertica
.
Once the query finishes, Virtual DataPort will delete the
tables from ds_jdbc_vertica
.
To enable the movement of data of a view, select the check box Force data movement and, in the box below, select the data source that represents the database where the table will be created.
Options of the CONTEXT Clause that Control a Data Movement¶
The CONTEXT
clause of the queries has three clauses that control data movements:
data_movement_bulk_load
. Ifoff
and the execution engine is going to perform a data movement for this query, the execution engine will not use the bulk load API of the target database. Instead, it uses INSERT statements to insert the data.Default value:
on
. If the target data source is configured to use the bulk load API of the data source, the execution engine uses it by default.data_movement_clean_resources
. Iftrue
, once a data movement finishes, the execution engine will remove the temporary table generated to perform the data movement. If the execution engine uses the bulk load API of the database to perform the data movement, it will remove the delimited files generated to perform this operation. Iffalse
, it will not remove the temporary table nor these files.Default value:
true
.Note that the files generated for data movements contain all the data to be inserted. Therefore, we do not recommend setting this property to
false
if the data movement of a query involves millions of rows and it is executed very often (for example, from a job of Denodo Scheduler). As the files will use a lot of space, the hard drive could be filled with the files generated for each execution of the query.data_movement_clean_resources_on_error
. Iffalse
and a data movement fails, the execution engine will not remove the temporary table generated to perform the data movement. If the execution engine uses the bulk load API of the database to perform the data movement, it will not remove the delimited files generated to perform this operation. If the data movement succeeds, the execution engine removes these files and the temporary table.Default value:
true
. With this value, the execute engine will delete these files and the temporary table regardless of if the statement fails or not, unless you set the propertydata_movement_clean_resources
tofalse
.This option is useful to debug problems when setting up a job in Denodo Scheduler that involves data movement. It will allow you to debug issues in jobs that fail but without having to keep the data files of queries that succeed.
Note
VirtualDataPort utilizes standard tables for Data Movements.
When performing data movements to SQLServer, it is possible to use Global Temporary Tables. To enable this feature, execute the following command:
SET 'com.denodo.vdb.util.tablemanagement.sql.SQLServerTableManager.useGlobalTemporaryTables' = 'true';
There is no need to restart the VDP. The property value is effective just after executing the above command
Customize the Data Movement Table Creation Command¶
Denodo uses built-in SQL sentences to create the data movement tables. It is possible to customize the command used to create these tables. You have to edit the data movement target data source to set a custom table creation template for data movements.
ALTER DATASOURCE JDBC my_data_source
CREATE_TABLE_TEMPLATES (
my_data_movements_template = '<my template definition>'
)
DEFAULT CREATE_TABLE_TEMPLATES (DATA_MOVEMENT = my_data_movements_template);
Warning
Before editing the table creation templates, you should read the conventions to modify the table creation templates using an ALTER DATASOURCE command.
See section Cache Table Creation Templates for more information.