Boolean Content Filter

This filter acts on the fields specified in the parameter Input field. It is also possible to choose whether to discard documents or not when the fields on which the filter is applied do not exist.

The Expression parameter allows specifying a set of regular expressions. The content of at least one of the previously specified fields should match with at least one of the regular expressions. In other case, the filter will reject the tuple. If none of the fields specified in the Input field parameter exist, and the check box Discard document if fields do not exist is not checked, the tuple will not be rejected; if it is checked, it will. The following subsection details the syntax used to specify regular expressions.

Although this filter is applicable to the tuples returned by any type of job, it is especially targeted at ARN jobs. That is why the fields “title” and “content” appear by default (they always appear in the documents obtained by Aracne) as Input Fields.

Syntax for the Expressions in the Content Filters

The expressions can be:

  • Simple, formed from just one keyword.
  • Compound, formed from more than one keyword combined using operators.

Keywords

Keywords constitute the terms to be searched in the value of a field. These should be enclosed in double quotation marks.

The search for keywords in the value of a field is carried out without distinguishing between lower and upper case. Thus, for example, the keywords Management and management have the same behavior. In this document, all the keywords used as an example are written in lower case and without accents.

Keywords, like expressions, can be:

  • Simple, formed from one term. The search is positive if the term appears in the value of the field.

    "internet"

    "telecommunications"

  • Compound, formed from more than one term. The search is positive only if the terms appear in the value of a field in the correct order.

    "electronic commerce"

    "risk prevention in the workplace"

When compound keywords are used like “electronic commerce”, only one space should be put between the terms. Each space is interpreted as one or more spaces in the document to be filtered.

Wildcards can be used to represent variable or optional parts in the keywords:

  • Asterisk (*) represents a group of zero or more characters without spaces, punctuation marks, hyphens, etc.
  • Question mark (?) represents a single character that may or may not appear. In this case, any character is valid, including spaces, punctuation marks, hyphens, etc.

With the help of wildcards, bigger keywords can be constructed that cover different variants of a term.

Thus, for example, variations in the end of a term can be dealt with by including the asterisk wildcard at the end of it.

"grant\*" would give a positive result, if the terms grant, grants, etc. appear in the value of the field.

Various terms can also be covered in one keyword, where these share the same root or the same ending.

"*silicon" would give a positive result, if the terms silicon, ferrosilicon, etc. appear in the value of the field.

"*silic*" would give a positive result, if the terms silicon, ferrosilicon, silicate, etc. appear in the value of the field.

The asterisk wildcard can also go in the middle of a term:

"elect*fy" would give a positive result, if the terms electrify, electronify, etc. appear in the value of the fields, but terms such as electricity or electrification would be left out.

Although the asterisk wildcard reflects options in the sense that it represents a group of characters that may or may not appear, it does not represent special characters like punctuation marks. This problem can be avoided with the question mark wildcard.

"co?generation" would give a positive result, if the terms cogeneration, co-generation, co generation, etc. appear in the value of the field.

Wildcards can also be applied to compound keywords.

"co?generation plant\*"

"industrial waste\*"

Operators

Operators can be classified into:

  • Unaries are placed before a simple or compound expression modifying its meaning. In the case of compound expressions, these should be enclosed in brackets.
  • Binaries combine two expressions to form a compound expression. The expressions to be combined can be simple or compound expressions, which should then be enclosed in brackets.

The available operators are:

  • Negation Operator (!)- unary operator that inverts the meaning of the expression it goes before.

    !"grant" would give a positive result, if the word grant does NOT appear in the value of the field.

    !"*silicon" would give a positive result, if the word silicon does NOT appear in the value of the field nor any word that ends with silicon.

  • Operator AND (&&) - binary operator that obliges fulfillment of the expressions that combine for the global result to be positive.

    "commerce" && "internet" would give a positive result, if the words commerce and internet appear in the document in any part, even where they are not contiguous.

  • Operator OR (||) - binary operator that requires satisfaction of at least one of the two expressions that combine for the global result to be positive.

    "commerce" || "internet" would give a positive result, if the word commerce, the word internet, or both appear in the value of the field.

More complex expressions can be formed by putting compound expressions between brackets and combining them in turn with the above operators:

"commerce" && ("electronic" || "internet") would give a positive result, if the word commerce and either the word electronic or the word Internet are contained in the value of the field.

“commerce && (“electronic” && (!”B2C”)) would give a positive result, if the word commerce and also the word electronic appear in the value of the field, but not the word B2C.