Boolean Content Filter¶
This filter acts on the fields specified in the parameter Input field. It is also possible to choose whether to discard documents or not when the fields on which the filter is applied do not exist.
The Expression parameter allows specifying a set of regular expressions. The content of at least one of the previously specified fields should match with at least one of the regular expressions. In other case, the filter will reject the tuple. If none of the fields specified in the Input field parameter exist, and the check box Discard document if fields do not exist is not checked, the tuple will not be rejected; if it is checked, it will. The following subsection details the syntax used to specify regular expressions.
Although this filter is applicable to the tuples returned by any type of job, it is especially targeted at ARN jobs. That is why the fields “title” and “content” appear by default (they always appear in the documents obtained by Aracne) as Input Fields.
Syntax for the Expressions in the Content Filters¶
The expressions can be:
Simple, formed from just one keyword.
Compound, formed from more than one keyword combined using operators.
Keywords¶
Keywords constitute the terms to be searched in the value of a field. These should be enclosed in double quotation marks.
The search for keywords in the value of a field is carried out without
distinguishing between lower and upper case. Thus, for example, the
keywords Management
and management
have the same
behavior. In this document, all the keywords used as an example are
written in lower case and without accents.
Keywords, like expressions, can be:
Simple, formed from one term. The search is positive if the term appears in the value of the field.
"internet"
"telecommunications"
Compound, formed from more than one term. The search is positive only if the terms appear in the value of a field in the correct order.
"electronic commerce"
"risk prevention in the workplace"
When compound keywords are used like “electronic commerce”, only one space should be put between the terms. Each space is interpreted as one or more spaces in the document to be filtered.
Wildcards can be used to represent variable or optional parts in the keywords:
Asterisk (*)
represents a group of zero or more characters without spaces, punctuation marks, hyphens, etc.Question mark (?)
represents a single character that may or may not appear. In this case, any character is valid, including spaces, punctuation marks, hyphens, etc.
With the help of wildcards, bigger keywords can be constructed that cover different variants of a term.
Thus, for example, variations in the end of a term can be dealt with by including the asterisk wildcard at the end of it.
"grant\*"
would give a positive result, if the terms grant, grants, etc. appear in the value of the field.
Various terms can also be covered in one keyword, where these share the same root or the same ending.
"*silicon"
would give a positive result, if the termssilicon
,ferrosilicon
, etc. appear in the value of the field.
"*silic*"
would give a positive result, if the termssilicon
,ferrosilicon
,silicate
, etc. appear in the value of the field.
The asterisk wildcard can also go in the middle of a term:
"elect*fy"
would give a positive result, if the termselectrify
,electronify
, etc. appear in the value of the fields, but terms such aselectricity
orelectrification
would be left out.
Although the asterisk wildcard reflects options in the sense that it represents a group of characters that may or may not appear, it does not represent special characters like punctuation marks. This problem can be avoided with the question mark wildcard.
"co?generation"
would give a positive result, if the termscogeneration
,co-generation
,co generation
, etc. appear in the value of the field.
Wildcards can also be applied to compound keywords.
"co?generation plant\*"
"industrial waste\*"
Operators¶
Operators can be classified into:
Unaries
are placed before a simple or compound expression modifying its meaning. In the case of compound expressions, these should be enclosed in brackets.Binaries
combine two expressions to form a compound expression. The expressions to be combined can be simple or compound expressions, which should then be enclosed in brackets.
The available operators are:
Negation Operator (!)
- unary operator that inverts the meaning of the expression it goes before.!"grant"
would give a positive result, if the wordgrant
does NOT appear in the value of the field.!"*silicon"
would give a positive result, if the wordsilicon
does NOT appear in the value of the field nor any word that ends withsilicon
.Operator AND (&&)
- binary operator that obliges fulfillment of the expressions that combine for the global result to be positive."commerce" && "internet"
would give a positive result, if the wordscommerce
andinternet
appear in the document in any part, even where they are not contiguous.Operator OR (||)
- binary operator that requires satisfaction of at least one of the two expressions that combine for the global result to be positive."commerce" || "internet"
would give a positive result, if the wordcommerce
, the wordinternet
, or both appear in the value of the field.
More complex expressions can be formed by putting compound expressions between brackets and combining them in turn with the above operators:
"commerce" && ("electronic" || "internet")
would give a positive result, if the wordcommerce
and either the wordelectronic
or the word Internet are contained in the value of the field.“commerce && (“electronic” && (!”B2C”)) would give a positive result, if the word
commerce
and also the wordelectronic
appear in the value of the field, but not the wordB2C
.