Content Extraction Filter (HTML, PDF, Word, Excel, PowerPoint, XML, EML, and Text)

This filter analyzes the content of the field specified in the parameter Input field to remove the possible marks associated with the document format in which it is codified and stores the text obtained in the field indicated in the parameter Output field. For example, in the case of a Web document, you can use this feature to remove HTML marks, JavaScript code, etc.

The input field for the filter can be of either binary or textual type. In the case of a binary field, there is an option to specify the name of the field that contains the charset for the document (Charset field parameter). If the Always auto-detect check box is selected, the content extractor tries to auto detect the encoding for the document contained in the input field. There is an option to specify, in the parameter MIME type field, the name of the field that contains the MIME type for the document. If this parameter is not specified, the filter will try to auto detect the adequate MIME type.

This filter also has two optional parameters to specify the begin (Begin delimiter) and end (End delimiter) paragraph delimiters for the returned text.

Note

If the Summary Generator Filter follows this one in a filters chain, the End delimiter parameter must be set to “\n”.

In the case of EML documents (typically obtained using Denodo Aracne), it also includes the text of the possible files attached to the e-mail.

According to the type of content extractor, this filter can add additional fields to the processed document:

  • The EML documents extractor also adds the following fields:
    • subject (text). A text with the subject of the e-mail message.
    • from (collection of texts). List of source e-mail addresses for the message.
    • recipient (collection of texts). List of destination e-mail addresses for the message.
    • replyto (collection of texts). List of e-mail addresses to which the message replies.
    • receiveddate (date). Date the message was received.
    • sentdate (date). Date the message was sent.
  • The PDF documents extractor also adds the following fields.
    • title (text). Document title.
    • subject (text). Subject of the document.
    • author (text). Name of the document’s author.
    • creator (text). Name of the person who created the document.
    • creationDate (date). Document’s creation date.
    • keywords (text). Text contained in the document’s keywords.
    • modificationDate (date). Document’s modification date.
    • producer (text). Application that generated the document.
  • The Microsoft Office documents extractor also adds the following fields:
    • title (text). Document title.
    • subject (text). Subject of the document.
    • author (text). Name of the document’s author.
    • creationDate (date). Document’s creation date.
    • keywords (text). Text contained in the document’s keywords.
    • modificationDate (date). Document’s modification date.

Although this filter is applicable to the tuples returned by any type of job, it is especially ARN-job-oriented, hence appearing by default as Input field the field “binarydata” that contains the document obtained from the crawler in binary format, as MIME type field the field “mimetype”, as Charset field the field “charset” and as Output field the field “content”. All these fields appear in the documents returned by Aracne.