General Architecture

Denodo Scheduler is a tool for time-based scheduling of automatic data extraction jobs from different data sources. In particular, it allows the configuration of different extraction jobs to be defined through its Web administration tool, persistently store this information, and plan the execution of these jobs against corresponding data servers as desired.

Denodo Scheduler allows extraction jobs to be defined against various modules of the Denodo Platform. It also allows data to be extracted from relational databases via JDBC.

For the extracted data, Denodo Scheduler allows different filtering algorithms to be applied and allows the data obtained to be exported in different formats and repositories.

At the core of the system are the extraction jobs that can be defined for the different components of the Denodo Platform.

  • Denodo Aracne. It is possible to define two types of jobs for this module: crawling and maintaining indexes, which will be performed on the crawling and indexing servers for Denodo Aracne.

    The crawling jobs (ARN) allow data to be collected from unstructured sources. The following subtypes of jobs are particularly considered:

    • WebBot and IECrawler crawl through the Web hypertext structure, starting with a group of initial URLs, and recursively retrieve all the pages accessible from the original URL group. They also allow connecting to an FTP server and obtaining the information contained in all the files and subdirectories of a specified directory as the initial URL. Multiple language fonts are supported for crawled documents.
    • WebBot is also capable of exploring a file system (even located in a shared folder) considering a directory as an initial URL and extracting data contained in all its files and subdirectories.
    • Global Search Crawler is able to crawl the Virtual DataPort RESTful web service (see the section RESTful Architecture of the Virtual DataPort Administration Guide.). This allows indexing and searching the information contained in a Virtual DataPort database.
    • POP3/POP3S/IMAP/IMAPS Crawler. Allows retrieval of data from e-mails stored in servers accessible using POP3, POP3S, IMAP or IMAPS protocols. This includes support for attached files.
    • Salesforce.com Crawler. Allows the retrieval of data contained in entities of data from Salesforce.com.
    • CustomCrawler allows data to be extracted from a data source through a Java implementation provided by the Denodo Aracne administrator. This type of robot allows the ad-hoc construction of a crawler for a specific source.

    Index maintenance jobs (ARN-INDEX) allow the automatic maintenance of indexes created, by deleting documents that are old, obsolete, not accessible, etc.

  • Denodo ITPilot (ITP). Executes queries on wrappers from Denodo ITPilot to obtain structured data from Web sources.

  • Denodo Virtual DataPort (VDP). Executes queries on wrappers and views defined in Virtual DataPort to obtain data resulting from the integration of data that can come from dispersed and heterogeneous sources.

  • It is also possible to define a JDBC type of job that explores the tables specified in a database and retrieves the data contained in them.

On a general level and for all jobs, it is possible to configure your time-based scheduling (when and how often it should be executed), various types of filters for post-processing the data retrieved by the system, and the way in which the results obtained by the job will be exported. The available exporters are:

  • Dumping the final results in a database.
  • Indexing the final results in the Aracne indexing server.
  • Dumping the final results in a CSV-type file (it can also be used to generate MS-Excel compliant files).
  • Dumping the final results in a SQL-type file.

It also allows the programmer to create new exporters for ad-hoc needs.

In the figure Denodo Scheduler Architecture the server’s basic architecture is shown. In addition to the jobs and filters, the scheduler lets users define the data sources to be used for the extraction jobs and by exporters. Denodo Scheduler allows data sources to be defined for the different components of the Denodo Platform (ARN, VDP, and ITP), for relational databases, and delimited files.

In the case of ITP-, VDP-, and JDBC-type jobs, it is possible to specify a query parameterized by a series of variables, along with the possible values for these variables, thus several queries are executed against the corresponding server.

Denodo Scheduler Architecture

Denodo Scheduler Architecture

The following briefly describes two typical examples of the use of Denodo Scheduler.

Example 1: extracting structured data from the Web with ITPilot

Suppose you want to periodically extract information from customers accessible via a corporate Web site. The Web site offers users a query form in which a customer’s Tax ID should be specified and returns as a response information of interest about the customer specified. The list of all the Tax IDs to be queried is available in an internal database accessible via JDBC. The set of data extracted must be dumped to another internal database also accessible via JDBC. The steps to be followed to carry out this job with the Denodo Platform are as follows:

  1. Create a new ITPilot wrapper that automates the operation of obtaining data from a customer from the corporate Web site. The wrapper will receive as a mandatory parameter a customer’s Tax ID, it will automatically execute the query on the Web, and will extract the desired results.
  2. Add a new JDBC-type data source to Scheduler to access the database that contains the Tax IDs of the required customers (see section JDBC Data Sources to find out how to add JDBC data sources).
  3. Add another new JDBC data source to Scheduler to access the database in which the extracted data will be dumped into.
  4. Create an ITP-type job in Scheduler (see section Configuring New Jobs). The ITP job will query a wrapper to which the different values will be specified for the Tax ID attribute. To get the different values of the Tax ID attribute, a query on the JDBC data source defined in step 2 will be used. Then, to execute the job, the ITPilot wrapper will be invoked for each of the Tax IDs sought.
  5. Create a JDBC-type exporter for the ITP job (see section Postprocessing Section (Exporters)). This exporter will use the JDBC data source defined in step 3.
  6. Finally, configure the frequency with which you want to execute the job in Scheduler (see section Time-based Job Scheduling Section).

Example 2: crawling, filtering and indexing of unstructured data with Denodo Aracne

Suppose you want to periodically explore a particular Web site to download all the documents relevant to a specific topic. The new documents found must be dumped in an index that will then be used by a search engine to make complex Boolean, keyword-based searches. The steps required to carry out this job with the Denodo Platform are the following:

  1. Create a WebBot- or IECrawler-type ARN job (see section Configuring New Jobs). This job will complete a crawling of the desired Web site, downloading all the documents found.
  2. Create a sequence of filters for post-processing the documents obtained by crawling. For example, you can use the Boolean content filter (see section Boolean Content Filter) to retain only those documents containing certain keywords relevant for the topic desired, the uniqueness filter (see section URL Unicity and Standardization Filters) to discard document duplicates and the content filter (see section Content Extraction Filter (HTML, PDF, Word, Excel, PowerPoint, XML, EML, and Text)) to only index the textual content of documents (by discarding the HTML marks and the Javascript code for the page).
  3. Create an Aracne index-type exporter for the job (see section Postprocessing Section (Exporters)). Thus, the documents will be indexed so their content can be searched.
  4. Finally, configure the frequency with which you want to execute the job in Scheduler (see section Time-based Job Scheduling Section).