Filter Sequences

Once data has been extracted from the sources, the obtained tuples can be filtered and/or modified by applying a filter sequence to them.

A filter sequence is comprised of individual filters in which the output of a filter becomes the input for the next filter in the sequence. The input for a filter sequence are the tuples/documents obtained by the extractors, and the output are those tuples/documents that verify all the filters, possibly modified or extended with additional data generated by the filters in the chain.

Managing filter sequences is accomplished in the “Filter Sequences” perspective (Filter sequences screen), where a list of the existing filter sequences is shown. If the user has previously selected an active project in the “Projects” perspective, then this list will only show the filter sequences created for that project. Otherwise, the user will be shown the whole list of created filter sequences (it is possible to filter them by project). Once a filter sequence has been created, it can be changed or deleted by clicking on its name in the list.

Filter sequences screen

Filter sequences screen

There is also a button to create a new filter sequence. If the user has previously selected an active project, the new filter sequence will be created in this project; otherwise, a dialog will ask the user to select a project prior to create the new filter sequence.

Once in the filter sequences editing screen, to add a new filter click on the New filter button (as shown in Filter sequences editing screen). When clicking on it, a list of the different types of filters that can be created is shown. It is possible to reorder the filters of a sequence by clicking on the up and down arrows in the filters list.

Filter sequences editing screen

Filter sequences editing screen

It is important to note that only users with Admin permission over a project can manage (create, edit and remove) its filter sequences. Of course, a user with global Admin permission can manage filter sequences in every project.

The platform provides a series of pre-defined filters and also offers the possibility of adding new filters to the system (see section Filters). To create a filter chain the user should specify the filters it comprises, the execution order, and the parameters for each filter.

The filters included are:

  • Boolean. Boolean Content Filter. Allows tuples to be filtered according to whether the content of some of their fields verifies or not a specific boolean expression composed of various keywords.
  • Content-extractor. HTML, PDF, Word, Excel, PowerPoint, XML, EML, and Text Content Extraction Filter. Extracts useful texts contained in documents in the respective formats by rejecting formatting marks.
  • New-field. Filter for aggregating a new field to the tuples. Adds a new field to the tuple, allowing its name and value to be specified.
  • Summary-generator. Summary Generation Filter. Automatically generates a summary of the content of a document.
  • Title-generator. Title Generation Filter. Automatically generates a title for the contents of a document.
  • Unicity. Unicity Filter. Deletes the tuples that have the same value in a specified field.
  • Uri-normalizer. URI Normalization Filter. This transforms URIs into a normalized format for comparison.
  • Useful-content-extractor. Useful Content Extraction Filter. This filter uses several heuristics to automatically extract the useful content of a document, eliminating browser menus, images, and other normal adornments in many Web documents. This filter uses the Content-extractor filter internally (Content Extraction Filter); therefore the Content Extraction Filter needs not be included, if the Useful Content Extraction Filter is used.

For Aracne-type jobs, Scheduler distributes a pre-created filter sequence (default_arn). This sequence of filters features the following filters:

  • Unicity Filter
  • URI Normalization Filter
  • Useful Content Extraction Filter
  • Title Generation Filter
  • Summary Generation Filter

Also, for Aracne-type jobs using the Denodo Global Search crawler, Scheduler distributes a pre-created filter sequence (globalsearch_fs). This sequence of filters features the following filters:

  • Unicity Filter
  • Useful Content Extraction Filter

For a more detailed explanation of the characteristics of each filter, see the following subsections.