VDPDataLoad Extraction Section

To configure the extraction section for VDPDataLoad-type jobs a VDP-type data source needs to be selected. Once selected, several load processes to refresh the data of VDP views may be created, by clicking on Add Load Process.

Additionally, it is possible to configure the number of load processes to run concurrently during the job execution (Concurrency Level).

When clicking on Add Load Process, choose the type of the load process to create:

  • JDBC Table: does an incremental or complete data refresh in a JDBC or Remote table.

  • Summary: does an incremental or complete data refresh in a Summary.

JDBC Table Load Process

The configuration of each load process (Example of the extraction section of a VDPDataLoad job) consists of:

  • The VDP Database. Allows to specify the VDP database used to retrieve views in view combo box.

  • The VDP View to refresh or insert data. JDBC and Remote table views in this combo box are retrieved from the selected database in database combo box. When VDP database combo box is not available, views in this combo box are retrieved from the VDP data source configured for the job (all views from every database for which the user has permissions are listed, not only the ones from the database specified in the data source). Note that the corresponding VDP server must be running (there is a refresh button that allows reloading the views from the VDP data source at any moment).

  • The query Options. Allows to specify properties for the OPTIONS clause of the query sent to VDP, to configure the insertion. Some data sources do not support some parameters. The available insert options are:

    • batch_insert_size: the number of tuples of each batch insert.

    • parquet_row_group_size: the row group size in megabytes. The property is only valid for the data sources for which Virtual DataPort uses parquet files to insert the data: Athena, Hive, Impala, Presto, and Spark.

  • The Incremental selector. This selector only appears if the selected view is a Remote Table. Regular JDBC views always do an incremental load, so the selector will not be available.

    • If enabled, an incremental data load will be performed. The data obtained from the load query will be inserted in the selected view. Previous data in the view will not be deleted. Fields to define the load query will be available.

    • If disabled, all data of JDBC or Remote Table view will be refreshed, using the default load query. That means all current data in the table will be deleted, and data obtained from default load query will be inserted.

Incremental load configuration

Following fields are available only when incremental selector is enabled:

  • The Load Query Selector. Choose “Custom conditions” or “Custom Load Query” options, depending on how you want to create the load query used to extract the data to be inserted in the selected view. This selector only appears if the base view is a Remote Table. Regular JDBC views always do an incremental load with Custom Load Query.

    • If Custom Load Query option is selected, the user will define manually the load query.

      • By default, its value will be the load query used to create the Remote Table, or empty when the selected view is not a Remote Table. The user can modify this default value to obtain the desired data. The result of this query will be the rows inserted in the JDBC or Remote Table.

      • When defining the custom load query, OFFSET, LIMIT and FETCH clauses are not supported, since those clauses are not supported in an INSERT sentence.

    • If Custom Conditions is selected, the load query will be the result of adding to the where clause of the load query used when the table was created, the conditions clause composed from the Field Name and Custom Conditions values selected by the user.

      • The Field Name. If you select a field from the view, then the condition fieldName > @LAST_REFRESH_DATE will be added to the load query. Field Name is optional if the user selects Custom Conditions.

      • The Custom Conditions. It is the where part of the load query. Custom conditions are optional if the user already selected a Field Name, and there are two combo boxes to help the user to create them:

        • One to add fields of the selected view to the condition (mandatory fields are indicated). This combo is updated every time a new view is selected (if the VDP data source is accessible).

        • Another one to add operators to the condition. This combo contains all the possible operators that can be used in a condition.

        You can use the variable \@LAST_REFRESH_DATE to reference the time of the last execution, when doing a incremental load.

  • An option to Enable update on duplicate keys. When enabled, the rows with the same values as the specified fields will be updated, instead of doing an insert. Optional Update on duplicate keys field allows to select those fields. If there are no values in this field, primary key is selected, when available.

Generated Queries

Given the value of these fields, the query sent to the VDP data source will be as follows:

Non incremental

REFRESH view
[ OPTIONS ( <option information> [ , <option information> ]* ) ]
TRACE

where

<option information> ::=
      'batch_insert_size' = <literal>
      | 'parquet_row_group_size' = <literal>

Incremental

INSERT INTO view
{ ON DUPLICATE KEY UPDATE |
  ON DUPLICATE KEY ( <field name> [, <field name>]* ) UPDATE }
  [ OPTIONS ( <option information> [ , <option information> ]* ) ]
<select load query>
[ TRACE ]

where select load query is:

  • when Custom Conditions:

<select load query> ::= select * from (<default load query> <conditions clause>)

depending on default load query having WHERE clause or not, conditions clause will be appended using AND or WHERE

<conditions clause> ::=  WHERE <conditions>
                            | AND <conditions>
<conditions> ::= <fieldName> > @LAST_REFRESH_DATE
                            | <custom_conditions>
                            | <fieldName> > @LAST_REFRESH_DATE AND <custom_conditions>
  • When Custom Load Query:

<select load query> ::= <custom load query>

There is also a field Load Process Name, where the user may specify a user-friendly name for each load process. This name will appear in the list of load processes to better identify them. By default, it is filled in with the name of the view to be preloaded (note that it is not possible to assign the same name to several load processes).

Example of the extraction section of a VDPDataLoad job

Example of the extraction section of a VDPDataLoad job

Summary Load Process

Summary Load Processes are similar to JDBC Load Process. The only difference is that when Incremental selector is enabled, the user only has the option to fill the Custom Load Query. No Custom Conditions or Field name are available.

The configuration of each load process consists of:

  • The VDP Database. Allows to specify the VDP database used to retrieve views in view combo box.

  • The VDP View to refresh or insert data. Summary views in this combo box are retrieved from the selected database in database combo box. When VDP database combo box is not available, views in this combo box are retrieved from the VDP data source configured for the job (all views from every database for which the user has permissions are listed, not only the ones from the database specified in the data source). Note that the corresponding VDP server must be running (there is a refresh button that allows reloading the views from the VDP data source at any moment).

  • The query Options. Allows to specify properties for the OPTIONS clause of the query sent to VDP, to configure the insertion. Some data sources do not support some parameters. The available insert options are:

    • batch_insert_size: the number of tuples of each batch insert.

    • parquet_row_group_size: the row group size in megabytes. The property is only valid for the data sources for which Virtual DataPort uses parquet files to insert the data: Athena, Hive, Impala, Presto, and Spark.

  • The Incremental selector.

    • If enabled, an incremental data load will be performed. The data obtained from the load query will be inserted in the selected view. Previous data in the view will not be deleted. Fields to define the load query will be available.

    • If disabled, all data of Summary view will be refreshed, using the default load query. That means all current data in the table will be deleted, and data obtained from default load query will be inserted.

Incremental load configuration

Following fields are available only when incremental selector is enabled:

  • The Custom Load Query.

    • By default, its value will be the load query used to create the Summary. The user can modify this default value to obtain the desired data. The result of this query will be the rows inserted in the Summary.

    • When defining the custom load query, OFFSET, LIMIT and FETCH clauses are not supported, since those clauses are not supported in an INSERT sentence.

  • An option to Enable update on duplicate keys. When enabled, the rows with the same values as the specified fields will be updated, instead of doing an insert. Optional Update on duplicate keys field allows to select those fields. If there are no values in this field, primary key is selected, when available.

Generated Queries

Given the value of these fields, the query sent to the VDP data source will be as follows:

Non incremental

REFRESH summary_view
[ OPTIONS ( <option information> [ , <option information> ]* ) ]
TRACE

where

<option information> ::=
      'batch_insert_size' = <literal>
    | 'parquet_row_group_size' = <literal>

Incremental

INSERT INTO summary_view
{ ON DUPLICATE KEY UPDATE |
  ON DUPLICATE KEY ( <field name> [, <field name>]* ) UPDATE }
  [ OPTIONS ( <option information> [ , <option information> ]* ) ]
<custom load query>
[ TRACE ]

There is also a field Load Process Name, where the user may specify a user-friendly name for each load process. This name will appear in the list of load processes to better identify them. By default, it is filled in with the name of the view to be preloaded (note that it is not possible to assign the same name to several load processes).

General Considerations

The meaning of the values of the “Extracted” and “Processed” columns of the table of jobs is explained below:

  • The “Extracted” column shows the number of tuples extracted from the queried VDP views using the defined load query.

  • The “Inserted” column shows the number of tuples inserted in the Remote/JDBC Table or Summary.

Both columns show the aggregation of the values of each individual load process. If there are some warnings or errors, you can finally see the individual report for that load process. Both columns, “Extracted” and “Inserted”, also show the number of errors produced during each process, i.e., the number of extraction errors and the number of insertion errors, respectively (as in the rest of job types, if there are no errors, then the number of errors is not shown).

Note

When the connection to VDP is lost, the user can save the job as draft in order to not lose the work.