Scanners¶
To carry out the lexical analysis of the documents on which the
extraction process is executed the DEXTL interpreter uses a component
called scanner. Each DEXTL program must specify the scanner to be used
during the extraction process using the directive #include
indicating the file corresponding to a scanner definition:
#include “scanners/name_scanner”
Scanners are important because for a tagset to be used within a DEXTL program the tagset must be included in the scanner indicated by the program.
ITPilot includes the following built-in scanners:
StandardFullLexer4_6. Used by default by the process which generates DEXTL programs using examples. Uses the default tagset all4_6. Makes use of the lexer type that keeps blank spaces between tags (see section Lexer Types).
AutogeneratedLexer4_6_x. A set of pre-generated scanners that represents the most commonly used scanners. They use a subset of the tagsets mentioned in section Tagsets. Make use of the lexer type that keeps blank spaces between tags (see section Lexer Types).
Due to backwards compatibility reasons all the scanners of earlier versions of ITPilot are included. Some of these scanners are considered deprecated and should not be used in new projects: StandardHTMLLexer, StandardHTMLLexerJS, StandardFormLexer, StandardFormLexerJS and StandardLexerJS.
If new tagsets created by the user are to be used, then a new scanner should be created that contains said sets (see section Tagsets).
Lexer Types¶
Scanners are generated from “lexer types” or “skeletons”. ITPilot includes some options for the lexer type. Any option is valid for most applications, but there are some situations in which a specific one is preferred. This section describes the available options and provides some examples. The options are:
Replace tags inside extracted values by spaces (html_nojs_spaces). When a DEXTL specification that makes use of a scanner generated from this lexer type, extracts any data element such that this piece of information contains an HTML tag, this tag will be replaced by a blank space.
Remove tags inside extracted values (html_nojs). When a DEXTL specification that makes use of a scanner generated from this lexer type, extracts any data element which contains an HTML tag, this tag will be removed.
Do not remove script code inside values (html_js, deprecated). When a DEXTL specification makes use of a scanner generated from this lexer type, extracts any data element which contains JavaScript code, this code will not be removed. This option is considered as deprecated now and should not be used in new projects.
The following example illustrates the difference between the two first options. Let us suppose that a specification for an electronic bookstore has been created. Among other fields, the specification obtains the list of authors of every book. In the HTML code of the source, authors are shown with just a <BR> tag delimitation. For instance:
Jones, Peter <BR>
Smith, John
If the option Replace tags inside extracted values by spaces is used, the extraction process will substitute the <BR> tags by blank spaces, and the retrieved value will be ‘Jones, Peter Smith, John’. In case the option Remove tags inside extracted values, the retrieved value will be ‘Jones, PeterSmith, John’.
The following example shows when the option Remove tags inside extracted values is useful. Let us suppose that an electronic bookstore allows searching by specifying the first letters of any word contained in any book’s title. The results of that search show the searched key in bold letters. For instance, the result of a search by ‘enterpr’ might contain HTML fragments such as the following:
Advice for <B>Enterpr</B>ise leaders
<B>Enterpr</B>ise Information Systems
In this case, if a specification based on the lexer type ‘Remove tags inside extracted values’ is created, it would return the values ‘Advice for Enterprise leaders’ and ‘Enterprise Information Systems’, while a specification based on the option Replace tags inside extracted values by spaces would return the values ‘Advice for Enterpr ise leaders’ and ‘Enterpr ise Information Systems’ (notice the blank spaces).