Each level in the structure of the data to extract with an Extractor
component (see section Extractor) has an associated tagset. A scanner
is just a set of tagsets that can be used inside of the same Extractor
component: if we want to use tagset A and tagset B in the same
Extractor, we need to have a scanner with at least both tagsets.
A tag can be seen as a regular expression that matches an HTML element
appearing in the page which data is to be extracted. As its name
indicates, a tagset is a group of tags. When ITPilot applies a tagset on
a page, it ignores all HTML elements that do not match any tag in the
set. Besides, it considers tags to act as separators between data
elements so, therefore, it assumes that they cannot appear inside the
value of a field to extract.
Example: suppose that an electronic bookstore shows the authors of
any book by separating them with a carriage return, using a <BR> tag:
John Smith, <BR> Peter Jones
In this situation, we want to extract the name of all authors in a
single field of our Extractor structure (‘John Smith, Peter Jones’); to
accomplish it, the tagset used cannot include any tag that matches the
On the other hand, adding more tags to a set is useful in the following
We need to make the extraction process more specific. When an
Extractor extracts more data than desired, adding tags to the set
might be the solution.
We wish to extract the attribute value of an HTML element that does
not match any other tag in our set. The Extractor component only
allows extracting attributes from HTML elements that match tags of
the set that is being used (see section Configuration of the Extractor Component).