Generation Tool Global Preferences

From the dialog shown after selecting the option Tools->Preferences from the tool main menu, there exists a set of parameters that can be configured and that globally affect the process generation. The dialog has the following tabs:

  1. General

    1. Default Wrapper Locale: it allows configuring the default locale assigned to a wrapper when it is created. The locale of the wrapper is used by its Init component to type the values of the wrapper input parameters, and it is also used as the locale set to the base views when deploying wrappers to the Virtual DataPort server (see Deploying Wrappers to the Wrapper Server). This value can be modified, after creating the wrapper, through the “Options” button from the button area (at the lower right corner of the working area). Please see section Locale for more information.
    2. Default locale for Extractor components: it allows configuring the default locale that will be assigned automatically to Extractor components when they are created. This value can be modified for specific wrappers (useful when all or most of the Extractor components of this wrapper will require a different locale) through the “Options” button from the button area (please see section Locale for more information). The locale assigned to an Extractor component can be modified, after creating the component, through its wizard (see Locale Configuration of the Extractor Component).
    3. Auto-generated Scanner Configuration: as it will be explained in section Tagsets and Scanners, ITPilot requires the use of a scanner for extracting data from web pages. ITPilot provides a graphical scanner generation tool, but it also allows the possibility of generating a scanner automatically from the page characteristics, and the examples provided by the user to generate a specification. The scanners are generated from what it is known as “lexer types” or “skeletons”. In the Preferences dialog, ITPilot allows the user to choose the lexer type to be used as a base for auto-generated scanners. Most applications will not need to modify the default lexer type. Please see section Tagsets and Scanners Included in the Distribution to obtain a description of the available lexer types and how convenient it can be to use one or another.
    4. Temp directory: allows specifying a directory to store temporal data like the for example, the HTML documents generated by PDF, Word and Excel converters. Also, the SaveFile component saves files to this directory if no particular directory is specified in the component configuration.
  2. Browser pool. The generation tool uses a browser management system, or “pool” that is used to test the generated wrapper. The configurable features are the following:

    1. Configuration.

      1. Default type of browser. It allows to define the type of browser to be used (see section Comparison Between MSIE and Denodo Browser for more information about when to use a particular type of browser):

        • MSIE Browser: Microsoft Internet Explorer browser.
        • Denodo Browser: Denodo http client able to manage JavaScript.

        The type of browser will determine what settings are available to configure.

    2. Browser controls.

      1. Maximum download time. Maximum time (in milliseconds) that a browser will wait to download a page.
      2. Object timeout. Maximum time (in milliseconds) that a browser can be in use outside of the pool to serve a request of a web process. After that time has finished, the browser is destroyed. If the value of this parameter is lesser than 0, the browser can stay outside of the pool indefinitely.
      3. Show graphical interface (only for MSIE Browser). If selected, the browsers will show their graphical interface. To optimize the efficiency of the system, an application in a production environment should not show the browser graphical interface. However, it can be useful to change the value of this option for debugging purposes.
      4. Silent (only for MSIE Browser). If selected, the browser will automatically close all the JavaScript dialogs that appear during the execution of a navigation sequence.
    3. Download controls. This set of parameters specifies what kinds of contents might be downloaded by the browsers of the pool. The content types which download can be configured are:

      1. MSIE Browser: images, videos, background sounds, script programs, Java applets and ActiveX components.
      2. Denodo Browser: script programs.
    4. Cache controls. This group of parameters is for specifying whether or not the browser pool should use the local cache and/or–in the case of MSIE Browser–the proxy cache. In the case of Denodo Browser, the maximum number of JavaScript files that may be found in the cache can be configured by means of the “Maximum number of cached JavaScript files” parameter.

    5. Proxy: this set of parameters allows configuring browsers that must access internet through a proxy server. Specifically, the following parameters can be configured:

      1. Login: proxy user.
      2. Password: password for the proxy user.
      3. Domain (Windows 2000): Windows domain.
    6. Pool Size and reutilization policy: this area allows users to configure the pool size and the browser reuse strategy.

      1. Max. Pool Size: maximum number of browsers in the pool.
      2. Min. Pool Size: minimum number of browsers. The system will not reuse browsers already existing in the pool, unless the current number is equal to or greater than the value of this parameter.
      3. Max. Browser TTL: Maximum Time to Live of a persistent browser. If a persistent browser is active more than the specified time, it will be removed and a new one will be created with the same page loaded as the former browser. This is useful because, due to known problems in some versions of Microsoft Internet Explorer, when using this type of browser, performance may degrade if the browser has been open for too long. This option only applies to persistent browsers, not regular ones (see Create Persistent Browser for details about persistent browsers).
      4. Reusable Browsers: indicates if the browser pool can be reused to deal with more than one request. Enabling browser reusability increases the efficiency of most applications; however, it may not be suitable in cases where dealing with a previous request changes the browser response to subsequent requests (for example, through the use of cookies).
  1. Document conversion: the installation process of the Denodo Platform allows configuring the converters from PDF and Word/Excel to HTML. It is also possible to configure these same parameters from the preferences dialog:
    1. PDF to HTML Converter: allows to configure the PDF-to-HTML converter to be used by default. The types of converters currently available are:
      1. Acrobat Text: uses the plain text conversion tool of the Adobe Acrobat Professional software, from which ITPilot generates an HTML file (it is required for the Adobe product to be installed).
      2. Acrobat HTML: uses the HTML conversion tool of the Adobe Acrobat Professional software (it is required for the Adobe product to be installed).
      3. PDFBox 0.7.3 and PDFBox 1.x: PDFBox (Apache PDFBox - A Java PDF Library) is used to generate the HTML page. Version 1.x of PDFBox (currently version 1.6) is available since ITPilot 4.7; version 0.7.3 was included in previous releases of ITPilot and is still available.
    2. Conversion server port: port where the PDF conversion Server is listening. By default: 8448.
    3. Acrobat Prof. Plugin directory: location of the Adobe Acrobat Professional plugins directory. The plugins required to use Acrobat Pro’s conversion capabilities are stored in this directory.
    4. OpenOffice Directory: location of the OpenOffice installation, used in the Microsoft Word and Microsoft Excel to HTML conversion.
    5. Remove the temporal files generated by the PDF, Word and Excel converters: if selected, the temporal html files generated in document conversions are deleted when the wrapper ends its execution. If it is not selected, the files will remain in the temporal directory.