Title Generation Filter

This filter acts on the value of the field specified in the parameter Input field (which has to be textual) and stores the result in the field specified in the parameter Output field. It uses various heuristics to automatically generate a title for the content of the field specified. Its behavior varies depending on the type of value of the field processed:

  • In HTML documents, the value of the title corresponds to that of the HTML tag title, if this exists on the page. If it does not exist, an alternative title is automatically generated.
  • In the case of RSS items, the title corresponds to the field value “title” of the RSS item.
  • In the case of EML documents, the value of the title corresponds to the “subject” of the e-mail message.
  • In the rest of the documents, the title is generated automatically by applying various heuristics.

Although this filter is applicable to tuples returned for any type of job, it is particularly oriented to ARN jobs, hence appearing by default as Input field the field “content” with the content of the document obtained by the crawler and as Output field the field “title”.