USER MANUALS

Exporters Section

In Denodo Scheduler different export actions can be defined on the tuples obtained (except in VDPCache and VDPDataLoad jobs, which do not have an exporters section).

This section has a global option Export as transaction, which only affects JDBC exporters. When enabled, the export to all JDBC exporters (if any) is treated as a single transaction. This means that all target tables will contain the same data once the job is finished. If an error occurs in any of them, the export to all JDBC exporters (the other types are not affected) is rolled back and each one returns to the state in which it was at the beginning of the execution. Only when there is no error, the export with the new data is committed in JDBC exporters.

Scheduler provides the following exporters:

  • Scheduler Index. Stores the extracted tuples in an index of the Scheduler indexing server. To configure this exporter the following parameters need to be specified:

    • Data source. For selecting the Scheduler index data source associated to the index to which the tuples/documents will be exported.

    • Index name. Name of the Scheduler Index server index in which the tuples/documents will be stored.

    • Clear index. If checked, before proceeding to store new documents in the specified index, all documents associated with the identifier of the current job will be deleted.

    Note

    In Denodo 6.0 and earlier, these exporters were called ARN-Index.

  • Elasticsearch. Stores the extracted documents in an index of the Elasticsearch indexing server. To configure this exporter the following parameters need to be specified:

    • Data source. For selecting the Elasticsearch data source associated to the index to which the tuples/documents will be exported.

    • Index name. Name of the index where the tuples/documents will be stored.

    • Clear index. If checked, before proceeding to store new documents in the specified index, all documents associated with the identifier of the current job will be deleted.

    • Recreate index. If checked, it deletes the selected index and all its documents. Then creates again the index with the configuration specified in the following parameter.

    • Index Configuration. For specifying the parameters to create the index. These parameters only have effect when creating the index. If the index already exists and ‘Recreate Index’ is not checked, they will have no effect.

      • Shards. Number of primary shards the index is composed of. A shard is a single Lucene instance managed automatically by Elasticsearch. By default, an index has 5 primary shards, distributed amongst all nodes in the cluster.

      • Replicas. Number of copies of the primary shards in the cluster. By default, there is no replication.

      • Analyzer. The analyzer to be used should be selected in accordance with the language expected for the documents. Stopwords (those words that are very common in the language) will be removed.

    • Bulk processor configuration. In order to process high volume of data, the exporter groups the requests (documents to be indexed or deleted) in batches and sends all the requests of each batch to the Elasticsearch server at once. There are several parameters to configure when to send a batch:

      • Document chunk size. Number of documents that make up a chunk of results. When this number of documents (to be indexed of deleted) is reached, the exporter will send them to server, even though the other parameters have not been reached (by default 10000).

      • Chunk size (MB). Size of the documents that make up a chunk of results. When the size of the documents to be indexed or deleted reaches this size, the exporter sends them to the server, even though the other parameters have not been reached (by default 50 MB).

      • Flush interval. Maximum time (in seconds) that the exporter will wait for sending the pending requests to the server. When it reaches this timeout, it will send the pending requests to the server, even though the other parameters have not been reached (by default 60 seconds).

      • Concurrency level. Sets the number of concurrent requests allowed to be executed by the exporter, that is, the number of batches that can be sent to the server concurrently (by default 5).

  • CSV. Stores the tuples obtained in a plain text file delimited by the character specified in the Separator parameter. The user can specify in the Output file field the path and the name of the generated file. By default, it is generated in the directory <DENODO_HOME>/work/scheduler/data/csv with the name <PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_Group#<GID>_CSVExporter#<EID>_<JOB_START_TIME>.csv, JOB_START_TIME being the moment of execution of the job expressed in milliseconds since January 1, 1970. Checking the Include header check box includes a first line in the file with the name of the fields of each exported tuple. If the Overwrite to old file option is chosen, the file name will be generated without information of the moment of execution (without the JOB_START_TIME), so that multiple executions of the job will overwrite the same file instead of generating new files. Otherwise, if the Create new file option is chosen, each execution will create a new file (by adding the JOB_START_TIME to the file name). Finally, if the Append to old file option is chosen, the exported data will be appended to the existing file. Use the Encoding drop-down to select the encoding of the text file (with support for generating a file with BOM (Byte Order Mark) for the following encodings: UTF-8, UTF16-LE, UTF16-BE, UTF-32LE, UTF-32BE and GB-18030. Use the i18n drop-down in order to format dates and numbers according to the selected one. Finally, if you want to include the project name, job name job identifier and job start time among the fields for the exported tuples, it is necessary to check the last check box (Export job identifier, job name project name and execution time fields). By default, this exporter follows RFC 4180, so it encloses values between double quotes when required. However, it is possible to change this policy by selecting another option in the Quote Options field to encapsulate all fields in double quotes (Add quotation marks for all fields) or to never add them (Never add quotation marks).

    Note

    You can use the CSV exporter to generate files that can be directly opened with Microsoft Excel. In order to do that, you just need to specify ‘;’ as separator.

  • JDBC. Stores the tuples in a relational database table. To configure it a JDBC-type Data source needs to be specified that accesses the relational database containing the table. The optional parameter Catalog allows specifying the existing catalog where the schema is stored (this should only be set when the database to which the information is to be exported supports Catalogs, such as SQL Server). Another optional parameter is Schema, which allows you to specify the existing schema where the table is stored. The Table name parameter (case sensitive) specifies an existing table where the tuples will be inserted. By default, the destination table schema must have a field with a compatible data type for each job field. You can also include the project name, job name, job identifier and job execution time among the fields of the exported tuples by checking the Export job identifier, job name, project name and execution time fields’ check box). If you want the content of the table to be deleted prior to inserting the exported tuples, you can check the Delete table content option (in this case, the content of the table is deleted at the beginning of the job execution, not at the beginning of the exportation phase). By default, the JDBC exporter performs insertions, so if another tuple with the same values for the primary key already exists, it will result in an error. By checking the Update tuple if entry already exists in DB option, the JDBC exporter will update the tuple instead of inserting it (compound primary keys are allowed). Besides, if this option is not checked, the chunk (whose size is determined by the parameter Batch size) of documents received by the exporter is inserted in the data base as a batch (using JDBC Batch Inserts). You can also configure Error Management. There are three options:

    • Never rollback. A rollback will never be performed regardless of any errors in the exportation. Every chunk of documents is exported on its own transaction and the deletion of the previous content of the table (if that option is used) is also executed in one different transaction. At the end of each chunk, successfully inserted/updated tuples will be permanently stored in the database.

    • Rollback if the exportation of all documents fails. A rollback will be performed when all documents fail to be exported. The same transaction is used to export all the documents and to delete the previous content of the table (if that option is used).

    • Rollback if any exportation error occurs. Same as above, but a rollback is executed if an error occurs while deleting the previous content of the table (if that option is used), or the exportation of any document fails.

    Note

    If Export as transaction is enabled, the Error Management options of each individual JDBC exporter are overwritten.

  • Excel. Stores the tuples obtained in a Excel 2007 OOXML (.xlsx) file. The user can specify in the Output file field the path and the name of the generated file. By default, it is generated in the directory <DENODO_HOME>/work/scheduler/data/excel with the name <PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_Group#<GID>_ExcelExporter#<EID>_<JOB_START_TIME>.xlsx, JOB_START_TIME being the moment of execution of the job expressed in milliseconds since January 1, 1970. The user can specify the name of the excel sheet in Sheet name field. By default, excel sheet name will be Sheet1. Checking the Include header check box includes a first line in the file with the name of the fields of each exported tuple. If the Overwrite to old file option is chosen, the file name will be generated without information of the moment of execution (without the JOB_START_TIME), so that multiple executions of the job will overwrite the same file instead of generating new files. Otherwise, if the Create new file option is chosen, each execution will create a new file (by adding the JOB_START_TIME to the file name). Finally, if the Append to old file option is chosen, the exported data will be appended to the existing file. Use the i18n drop-down in order to format dates and numbers according to the selected one.

    • Export numbers as selector allows to configure Excel that should be used to export numeric values. User can select following options:

      • General

      • Number. If you choose this option, you should specify in the field Number of decimals how many decimals will be displayed in the cell for decimal values.

      • Text

      Note

      If you export numbers with General or Number formats, Excel will only store the 15 most significant figures. If you want to preserve all figures, you should export them as Text.

    • Date pattern. The pattern to be applied to values of type date. If the field is left empty we have two options:

      • If the user selects an i18n that defines a value for the field datesubtypepattern, datesubtypepattern will be used.

      • In case no i18n is selected, default i18n obtained from VDP datasource will be used.

    • Time pattern. The pattern to be applied to values of type time. If field is left empty we have two options:

      • If the user selects an i18n that defines a value for the field timesubtypepattern, timesubtypepattern will be used.

      • In case no i18n is selected, default i18n obtained from VDP datasource will be used.

    • Timestamp pattern. The pattern to be applied to values of type timestamp. If the field is left empty we have two options:

      • If the user selects an i18n that defines a value for the field datepattern, datepattern will be used.

      • If no i18n is selected, default i18n obtained from VDP datasource will be used.

    Note

    The syntax used in date, time and timestamp patterns is the same as the Java date patterns detailed in the section Date and Time Pattern Strings.

    Finally, if you want to include the project name, job name job identifier and job start time among the fields for the exported tuples, it is necessary to check the last check box (Export job identifier, job name project name and execution time fields).

  • Custom Exporters. Users can also create their own custom exporters. See section Exporters.

The list of available exporters for a job depends on the type of job:

Available exporters for each type of job

Exporter/Job

VDP

VDPIndexer

VDPCache

VDPDataLoad

Scheduler Index

N/A

N/A

Elasticsearch

N/A

N/A

CSV

N/A

N/A

JDBC

N/A

N/A

Excel

N/A

N/A

Custom

N/A

N/A

Note

The Scheduler Index exporter cannot index binary fields. The CSV and Excel exporters export binary fields as text encoded in Base64.

Note

With regard to compound data types, they are exported in text, formatted in XML.

Note

If using a JDBC exporter and the exportation of a tuple fails, the process will continue with the remaining ones of the block, unless the exception is applicable for the whole set of tuples. In this case, they will be discarded.

The report of the job will show the document identifier (tuple number) that failed and the reason.

Note

About CSV and Excel exporters:

  • The resultant files can be encrypted (using the MD5/DES or SHA-512/AES-256 cipher algorithm for CSV, and SHA-512/AES-256 cipher for Excel ) or compressed (optionally, you can configure the exporter to generate the ZIP file protected by a password).

  • Besides the standard variables (projectName, jobName, jobID and jobStartTime), it is possible to specify any variable (following the syntax @{field:<DOCUMENT_FIELD_NAME>}) in the Output file parameter, where DOCUMENT_FIELD_NAME represents a field in the exported document. This way, according to the value of these variables, the exported documents will go to a file or to another one (the file name will depend on the values of the fields of the exported document), or to different sheets in same file. It is also possible to format date variables by specifying input and output date patterns (input pattern is optional; if not specified, it assumes the date in milliseconds). The syntax is @{jobStartTime, outputDateFormat:<output_format>} for the standard variable jobStartTime and @{field:<DOCUMENT_FIELD_NAME>,inputDateFormat: <input_format>,outputDateFormat: <output_format>} for any other date field of the exported document, where, for instance, inputDateFormat could be yyyyMMdd and outputDateFormat “yyyy-MM-dd”.

  • By default, if no tuples are exported, the output files are deleted (unless they already exist from previous executions). If you want to keep the file even though it is empty, check the Create file even if no tuples are exported field.

Note

When exporting data to ODBC targets, the JDBC exporter used must have a data source configured with the “JDBC-ODBC Bridge” driver. If the target is an Excel sheet, the data source must be configured with the “Excel” adapter and the Table name field must be filled in with “[Sheet1$]”. Both drivers are distributed with Denodo Scheduler (see table JDBC Drivers). When working with Excel sheets it is important to take into account some limitations (see https://support.office.com/en-us/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3).

Add feedback