VDP Extraction Section¶
To configure the extraction section for VDP-type jobs a VDP-type data source needs to be selected. Once selected, the query to be performed against the VDP server needs to be specified using a parameterized query statement (Parameterized query field).
A parameterized query is a query expressed in the server’s query language (VQL), which can include variables prefixed with the character @ (the detailed syntax for including variables is explained in the section Syntax of Parameterized Queries). A parameterized query that includes variables represents a group of queries that you want to execute against the server. The different queries will be generated by replacing each variable with a value included in a list of input values. The lists of input values are obtained from a data source (see section Data Sources).
Example: Suppose that the next parameterized query has been
configured to obtain data from a view called CLIENTS
in the Virtual DataPort
server:
SELECT *
FROM CLIENTS
WHERE taxId = '@TaxId'
Suppose also that a data source for accessing a CSV file that contains a list of Tax Ids as follows has been configured:
and that these values have been associated with the variable @TaxId
(the section Configuring the Values to Be Used in a Parameterized Query
explains in detail how to do it).
Then Denodo Scheduler would generate the following queries on Virtual DataPort
to get all the job data:
SELECT * FROM CLIENTS WHERE TaxId = 'B78596011';
SELECT * FROM CLIENTS WHERE TaxId = 'B78596012';
SELECT * FROM CLIENTS WHERE TaxId = 'B78596013';
The section Configuring the Values to Be Used in a Parameterized Query describes the sources that can be used to assign values to variables.
Each source can provide value to one or more variables simultaneously. Where there are different sources, Scheduler generates as many queries from the parameterized statement as possible combinations in the data returned from each of the sources that provide values to variables.
Additionally, it is possible to configure the number of query combinations to run from the same parameterized query and the level of concurrence in the execution:
Maximum number of iterations. Specifies the maximum number of queries to be generated from the parameterized query statement specified. If the value specified is greater than the number of query combinations generated, it is ignored. If it is not specified, then all the combinations are executed.
Maximum number of concurrent iterations. Specifies the maximum number of queries that will be executed in parallel from the queries generated by the parameterized query. The concurrent execution is performed by blocks, i.e. while the execution of the queries of the first block has not ended, execution of the next block of queries is not initiated. If this value is not specified, all queries will be executed sequentially.
It is important to note that Scheduler logs query combinations that have been successfully executed to distinguish them from those that have not been executed yet or have returned some type of error. Therefore, it is possible to associate a handler for retries with the job (see section Handler Section) that repeats the execution of those queries that have returned an error in their last execution and those that have not been executed yet (useful in the event that, for unknown reasons, the server has performed its execution in an irregular way). It will also be possible to force the execution of failed queries from the “Scheduler” perspective using the action Start with state (see section Jobs). In both cases, if the option Refresh parameter values on retry is checked (default configuration), the sources will be queried in each execution (including retries) and their values will be retrieved in blocks, while if it is not checked, their values are read at once, cached and used in future retries or when starting with state. Note that in both cases only combinations of values that failed will be used for the execution of a retry or a new execution with state. As a recommendation, if the values from the sources change rarely 1 or they are too many, keep this option checked in order to reduce memory consumption.
Syntax of Parameterized Queries¶
A parameterized query is an expression depending on variables which generates a character string representing a query as a result. Variables are specified by prefixing them with the symbol ‘@’, followed by the name of the variable, provided that this name is an alphanumeric character string (letters and the characters ‘#’ and ‘_’). Variables with a name that includes any other character can be specified including the name between the symbols ‘@{‘ and ‘}’.
Note
When any of the symbols ‘@’, ‘\’, ‘^’, ‘{‘, ‘}’ appear in the constant parts of the parameterized statement, they must be escaped by the character ‘\’ (i.e. \@, \\, \^, \{, \}).
Configuring the Values to Be Used in a Parameterized Query¶
Parameterized queries can obtain their values from different data sources. Scheduler allows data to be obtained from a CSV file, from a query against a database or from a manually introduced list of values. The configuration needed for each type of source is as follows:
CSV. A CSV data source that has been created previously.
DATABASE (JDBC or VDP). A JDBC or VDP data source needs to be selected and a non-parameterized query to be executed against the database (Query (non-parameterized) field) specified.
LIST. A list of values needs to be specified (Values field) separated by the character specified in the Separator field.
In the case of the LIST type, each tuple only consists of a field and, therefore, can only assign values to a variable. In the case of DATABASE and CSV sources, a tuple can include various fields and, therefore, can assign values to more than one variable.
Example: Suppose that the next parameterized query has been
configured to obtain data from a view called COMPANY
in the Virtual DataPort
server:
SELECT *
FROM COMPANY
WHERE NAME = '@COMPANYNAME' AND INDUSTRY = '@COMPANYINDUSTRY'
Suppose that a data source for accessing a CSV file that contains a list of tuples with two fields each has also been configured. The data in the CSV file is as follows:
COMPANYNAME;COMPANYINDUSTRY
Denodo;Information Technologies
Acme Drinks;Beverages
Then the CSV source fields could be assigned to the variables so that they generate queries as follows to obtain all the job data:
SELECT * FROM COMPANY WHERE NAME = 'Denodo' AND INDUSTRY = 'Information Technologies';
SELECT * FROM COMPANY WHERE NAME = 'Acme Drinks' AND INDUSTRY = 'Beverages';
Once the data sources have been added and configured, it is necessary to define for which query variables does each source return values. This can be done in two different ways:
Implicit association. This type of association is only applicable for those sources that return tuples with field names (DATABASE and CSV files that specify header). In these cases, it is assumed that the variables used in the parameterized query have the same name as some of the fields returned by the data sources.
Explicit association. LIST sources allow defining which Query Parameter binds.
It is important to take into account that several sources cannot return values for the same variable.
Note
In previous versions of Scheduler it was possible to create explicit associations for CSV and DATABASE sources defining mappings between the variables in the query (Query Parameter) and the name of the fields in the source (Source Parameter). They are no longer supported in Scheduler 8.0, but the mappings present in jobs imported from previous versions will keep working and they will be displayed in this section. The user can continue using them or remove them and make the pertinent modifications in the job configuration.
Data Schema Generated by VDP Jobs¶
VDP jobs return the following fields in all its tuples, in addition to its own fields:
_$job_project (text). Name of the project pertaining to the job.
_$job_name (text). Name of the job.
_$job (numerical). Identifier of the job.
_$job_start_time (numerical). Time (in milliseconds) when the job was first executed.
_$job_retry_start_time (date). Time at which the current job execution started.
_$job_retry_count (numerical). Number of the current retry execution.
When the job exports the extracted data to an Elasticsearch data source, the following fields are not exported:
_$job_project.
_$job_retry_start_time.
_$job_retry_count.
- 1
If the “Refresh parameter values on retry” option is checked and the sources change between retries, the retried values could be different from the ones that failed.