Postprocessing Section (Exporters)¶
In Denodo Scheduler different post processing actions can be defined on the tuples obtained. Each post processing action will be composed of one filter sequence and one exporter of results. The filter sequence has to be created in advance (see section Filter Sequences). It allows using different criteria to filter/transform the tuples that will be sent to the different exporters.
The following are different exporters that Denodo Scheduler considers, along with a small description of use for each of them (users can also create their own custom exporters. See section Exporters).
ARN-Index. Stores the extracted tuples in an index of the Aracne indexing server. To configure this exporter the following parameters need to be specified:
Data source. For selecting the ARN-index data source associated to the index to which the tuples/documents will be exported.
Index name. Name of the ARN-Index server index in which the tuples/documents will be stored. The index has to exist before executing the job (see Administration of Indexes to get more information about how to create an index in the Denodo Aracne Administration Tool).
Note
When using this type of exporter, take into account that the index metadata has to exist before executing the job. When it is used within a VDPIndexer-type job, the index has to be configured with the globalsearch schema (there is one pre-created index globalsearch that meets this requirement).
Clear index. If checked, before proceeding to store new documents in the specified index, all documents associated with the identifier of the current job will be deleted.
See section Exporting to an index to get more information about how to configure this exporter when using the Global Search crawler.
CSV. Stores the tuples obtained in a plain text file delimited by the character specified in the Separator parameter. The user can specify in the Output file field the path and the name of the generated file. By default, it is generated in the directory
DENODO_HOME/work/scheduler/data/csv
with the name<PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_Group#<GID>_CSVExporter#<EID>_<JOB_START_TIME>.csv
, JOB_START_TIME being the moment of execution of the job expressed in milliseconds since January 1, 1970. Checking the Include header check box includes a first line in the file with the name of the fields of each exported tuple. If the Overwrite to old file radio button is ticked, the file name will be generated without information of the moment of execution (without the JOB_START_TIME), so that multiple executions of the job will overwrite the same file instead of generating new files. Otherwise, if the Create new file radio button is checked, each execution will create a new file (by adding the JOB_START_TIME to the file name). Finally, if the Append to old file radio button is checked, the exported data will be appended to the existing file. Use the Encoding drop-down to select the encoding of the text file (with support for generating a file with BOM - Byte Order Mark - for the following encodings: UTF-8, UTF16-LE, UTF16-BE, UTF-32LE, UTF-32BE and GB-18030). Use the i18n drop-down in order to format dates and numbers according to the selected one. It is also possible to specify associations between the field names in the tuples obtained by the job (parameter Document field) and the actual names of fields in the CSV file (CSV column, this parameter is case sensitive). If associations have been specified, the Export only mapped attributes check box allows to export those fields only. If you want to include the project name, job name job identifier and job start time among the fields for the exported tuples, it is necessary to check Export job identifier, job name project name and execution time fields. By selecting Create file even if no tuples are exported, the exporter will be allowed to create an empty file if there are no results to be exported.Note
You can use the CSV exporter to generate files that can be directly opened with Microsoft Excel. In order to do that, you just need to specify ‘;’ as separator.
Elasticsearch. Stores the extracted documents in an index of the Elasticsearch indexing server. To configure this exporter the following parameters need to be specified:
Data source. For selecting the Elasticsearch data source associated to the index to which the tuples/documents will be exported.
Index name. Name of the index where the tuples/documents will be stored.
Clear index. If checked, before proceeding to store new documents in the specified index, all documents associated with the identifier of the current job will be deleted.
Recreate index. If checked, it deletes the selected index and all its documents. Then creates again the index with the configuration specified in the following parameter.
Index Configuration. For specifying the parameters to create the index. These parameters only have effect when creating the index. If the index already exists and ‘Recreate Index’ is not checked, they will have no effect.
Shards. Number of primary shards the index is composed of. A shard is a single Lucene instance managed automatically by Elasticsearch. By default, an index has 5 primary shards, distributed amongst all nodes in the cluster.
Replicas. Number of copies of the primary shards in the cluster. By default, there is no replication.
Analyzer. The analyzer to be used should be selected in accordance with the language expected for the documents. Stopwords (those words that are very common in the language) will be removed.
Bulk processor configuration. In order to process high volume of data, the exporter groups the requests (documents to be indexed or deleted) in batches and sends all the requests of each batch to the Elasticsearch server at once. There are several parameters to configure when to send a batch:
Document chunk size. Number of documents that make up a chunk of results. When this number of documents (to be indexed of deleted) is reached, the exporter will send them to server, even though the other parameters have not been reached (by default 10000).
Chunk size (MB). Size of the documents that make up a chunk of results. When the size of the documents to be indexed or deleted reaches this size, the exporter sends them to the server, even though the other parameters have not been reached (by default 50 MB).
Flush interval. Maximum time (in seconds) that the exporter will wait for sending the pending requests to the server. When it reaches this timeout, it will send the pending requests to the server, even though the other parameters have not been reached (by default 60 seconds).
Concurrency level. Sets the number of concurrent requests allowed to be executed by the exporter, that is, the number of batches that can be sent to the server concurrently (by default 5).
File-Repository. Although it can be used with any type of job, this exporter is specially oriented to ARN jobs. This exporter allows a local repository to be created with the documents obtained by the crawler. The exporter stores the content of the tuple field specified in Content field on disk. The files will be stored in the directory
DENODO_HOME/work/scheduler/data/repository/<PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_<JOB_START_TIME>
. The relative path of each file inside that directory is specified by the value of the tuple field indicated in the Path field parameter. In ARN jobs, the binarydata field of the tuple contains the document obtained in the crawling, and the path field contains the path extracted for it from the website. Therefore, if we respectively specify “binary data” and “path” as values for the Content field and Path field parameters, this filter will create a local repository replicating the structure of the crawled website, If the Clean check box is ticked, the repository contents will be deleted before each execution of the job,JDBC. Stores the tuples in a relational database table. To configure it a JDBC-type Data source needs to be specified that accesses the relational database containing the table. The optional parameter Schema allows specifying the existing schema where the table is stored. The Table name parameter (case sensitive) specifies an existing table where the tuples will be inserted. By default, the destination table schema must have a field with a compatible data type for each job field. It is also possible to specify associations between the field names in the tuples obtained by the job (parameter Document field) and the actual names of fields in the database table (Table column, this parameter is case sensitive). If associations have been specified, the Export only mapped attributes check box allows to export those fields only. You can also include the project name, job name, job identifier and job execution time among the fields of the exported tuples by checking the Export job identifier, job name, project name and execution time fields’ check box). If you want the content of the table to be deleted prior to inserting the exported tuples, you can check the Delete table content option (in this case, the content of the table is deleted at the beginning of the job execution, not at the beginning of the exportation phase). By default, the JDBC exporter performs insertions, so if another tuple with the same values for the primary key already exists, it will result in an error. By checking the Update tuple if entry already exists in DB option, the JDBC exporter will update the tuple instead of inserting it (compound primary keys are allowed). Besides, if this option is not checked, the chunk (whose size is determined by the parameter Batch size) of documents received by the exporter is inserted in the data base as a batch (using JDBC Batch Inserts). You can also configure Error Management. There are three options:
Never rollback. A rollback will never be performed regardless of any errors in the exportation. Every chunk of documents is exported on its own transaction and the deletion of the previous content of the table (if that option is used) is also executed in one different transaction. At the end of each chunk, successfully inserted/updated tuples will be permanently stored in the database.
Rollback if the exportation of all documents fails. A rollback will be performed when all documents fail to be exported. The same transaction is used to export all the documents and to delete the previous content of the table (if that option is used).
Rollback if any exportation error occurs. Same as above, but a rollback is executed if an error occurs while deleting the previous content of the table (if that option is used), or the exportation of any document fails.
<li class=”List-Bullet”>
SQL. The SQL exporter functionality is similar to that of the JDBC exporter, but instead of inserting tuples in a relational database it generates a text file with corresponding sql INSERT statements. This way, the generated file can be used to load the data in a database at a later time. Unlike the JDBC exporter, it does not require a Data source to be specified, but it is possible to specify a database adapter (Database adapter) to take into account the differences in the SQL syntax used by different databases (currently a special adapter for Oracle and another for SQL Server are supported). The user can specify in the Output file field the path and the name of the generated file. By default, it will be generated in the
DENODO_HOME/work/scheduler/data/sql
directory with the name following the convention<PROJECT_NAME>_<JOB_NAME>_<JOB_ID>_Group#<GID>_SQLExporter#<EID>_<JOB_START_TIME>.sql
, JOB_START_TIME being the moment the job is executed expressed in milliseconds since January 1, 1970. If the Overwrite to old file radio button is ticked, the file name will be generated without information of the moment of execution (without the JOB_START_TIME), so that multiple executions of the job will overwrite the same file instead of generating new files. Otherwise, if the Create new file radio button is checked, each execution will create a new file (by adding the JOB_START_TIME to the file name). Finally, if the Append to old file radio button is checked, the exported data will be appended to the existing file. Use the Encoding drop-down to select the encoding of the text file (with support for generating a file with BOM - Byte Order Mark - for the following encodings: UTF-8, UTF16-LE, UTF16-BE, UTF-32LE, UTF-32BE and GB-18030). By selecting Create file even if no tuples are exported, the exporter will be allowed to create an empty file if there are no results to be exported.
Note
The ARN-Index exporter does not support indexing of binary fields. In the case of CSV and SQL exporters, binary fields are exported as text encoded in base64.
Note
With regard to compound data types, they are exported in text, formatted in XML.
Note
If using a JDBC exporter and the exportation of a tuple fails, the process will continue with the remaining ones of the block, unless the exception is applicable for the whole set of tuples. In this case, they will be discarded.
The report of the job will show the document identifier (tuple number) that failed and the reason.
Note
In the CSV and SQL exporters, the resultant files can be encrypted (using the MD5-DES cipher algorithm) or compressed (ZIP file).
Note
In the CSV and SQL exporters, besides the standard variables
(projectName
, jobName
, jobID
and jobStartTime
), it is
possible to specify any variable (following the syntax
@{field:<DOCUMENT_FIELD_NAME>}
) in the Output file parameter,
where DOCUMENT_FIELD_NAME
represents a field in the exported
document. This way, according to the value of these variables, the
exported documents will go to a file or to another one (the file name
will depend on the values of the fields of the exported document). It is
also possible to format date variables by specifying input and output
date patterns (input pattern is optional; if not specified, it assumes
the date in milliseconds). The syntax is
@{jobStartTime, outputDateFormat:<output_format>}
for the standard
variable jobStartTime
and
@{field:<DOCUMENT_FIELD_NAME>,inputDateFormat: <input_format>,outputDateFormat: <output_format>}
for any other date field of the exported document, where, for instance,
inputDateFormat
could be yyyyMMdd
and
outputDateFormat “yyyy-MM-dd”
.
Note
When exporting data to ODBC targets, the JDBC exporter used must have a data source configured with the “JDBC-ODBC Bridge” driver. If the target is an Excel sheet, the data source must be configured with the “Excel” adapter and the Table name field must be filled in with “[Sheet1$]”. Both drivers are distributed with Denodo Scheduler (see table JDBC Drivers). When working with Excel sheets it is important to take into account some limitations (see https://support.office.com/en-us/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3).
Note
Consecutive exporters using the same filter sequence are grouped together. It is the recommended way of adding exporters to a job, in order to execute the filter sequence once for the whole group. If two exporters share the same filter sequence, but they are not consecutive (other exporters with a different filter sequence has been created between them) then they will not be in the same group.