Denodo Aracne Wrappers

Virtual DataPort supports the creation of wrappers on indexes of unstructured data created using Denodo Aracne.

To create a wrapper of this type, the name of the data source - DATASOURCENAME - must be indicated along with the name of the Aracne index handler - HANDLERNAME - used to create the wrapper.

As with the other wrappers, it is possible to specify the schema of the data returned by the wrapper (OUTPUTSCHEMA). In this case, the schema must contain a series of fixed attributes that are always returned by Aracne index handlers. Only the name of these fixed attributes may be modified. Furthermore, the schema may also include specific attributes corresponding to other additional fields exported by the Aracne handler.

Below is a description of the fixed attributes:

  • TASK. Name of the Aracne task that obtained and indexed this document. This is of string type.
  • PUBDATE. Document publication date. This only appears in RSS-type documents. This is of string-type.
  • TITLE. Title generated by Aracne for the document. This is of string type.
  • ANCHORTEXT. For documents obtained by Aracne using a Web crawling process, it contains the text associated to the link used to reach to this document. This is of string-type.
  • SUMMARY. Summary generated by Aracne for the document. This is of string type.
  • URL. In the case of documents obtained by a web crawling process, this contains the original document URL. In RSS documents, this corresponds to the link field value of the RSS item. In the case of documents obtained from a local file system, this contains the path to it. In the case of documents obtained from an e-mail server, it contains the name of the e-mail server and the name of the account to which the e-mail belongs. This is of string type.
  • IDENTIFIER. Standardized URL. This is of string-type.
  • CONTENT. “Useful” contents of the document generated by Aracne. This is of string type.
  • DESCRIPTION. This only appears in RSS-type documents. In this case, it takes the value of the DESCRIPTION element from the RSS document. This is of string type.
  • MODIFIED. Date on which the document in the index was last modified.
  • SEARCHABLECONTENT. Field added by Virtual DataPort that concatenates the contents of the main textual fields of the document (title, summary, contents, anchor text, etc.) and the specific text fields that the index may contain. This is the field on which searches are normally made.
  • LEVEL. Crawling depth level at which the document was obtained. This is of string type.
  • TYPE. Content type: HTML, PDF, RSS, etc. This is of the character string type.
  • TITLEXML. Title of the document in XML with information on the view structure of the contents (paragraphs). This field is used to visually represent the title and not for searches. This is of string type.
  • SUMMARYXML. Summary of the document in with information (encoded in XML) about how the text was visually distributed in paragraphs. This field is used to visually represent the summary and not for searches. This is of the character string type.
  • PATH. Path where the Aracne server saved a local copy of the document. This is of string type.
  • SCORE. Indication of the relative relevance of the document for the query. The results of a search are normally returned in decreasing order by SCORE. This is of float type.
  • MAXDOCS. Attribute added by Virtual DataPort to restrict the maximum number of results returned by a search. This is of integer type.
  • CATEGORIES. This only appears in RSS-type documents that contain a CATEGORIES element. In this case, it takes the value of this element from the RSS document. This is of string type.

Denodo Aracne also can automatically generate the most relevant words of a document or a field according to the TFIDF (Term Frequency Inverse Document Frequency) relevance measurement. These terms can be included in additional fields of the Virtual DataPort wrapper schema. The use of the FILTERMAINTERMS clause is related to this function. See section Adding Fields with the Most Relevant Terms.

The wrapper creation statement also accepts the OR REPLACE modifier. Where specified, if there is already a wrapper with the same name, its definition is replaced by the new one. The creation syntax is shown in Syntax of the CREATE WRAPPER ARN statement (Aracne).

Syntax of the CREATE WRAPPER ARN statement (Aracne)
CREATE [ OR REPLACE ] WRAPPER ARN <name:identifier>
    [ FOLDER = <literal> ]
    DATASOURCENAME = <name:identifier>
    HANDLERNAME = <literal>
    [ OUTPUTSCHEMA ( <field> [, <field> ]* ) ]
    [ FILTERMAINTERMLIST ( <literal> [, <literal> ]* ) ]

<field> ::=
      <simple field>
    | <name:identifier> [ = <mapping:literal> ] : ARRAY OF ( <register
    field>)
    <main terms constraint>

<simple field> ::=
    <name:identifier> [ = <mapping:literal> ] : <type:literal>
    [ ( { OBL | OPT } ) ] [ ( DEFAULTVALUE <literal> ) ] [ EXTERN ]

<register field> ::=
    <name:identifier> [ = <mapping:literal> ] :
    REGISTER OF ( [ <simple field> [, <simple field> ]* ] )

<main terms constraint> ::=
    MAINTERMS ( <field name:identifier>, <num of mainterms:integer>
    [, ( <literal> [, <literal> ]* ) ] )

The following figure shows an example of the creation of an Aracne wrapper.

In order for the Aracne wrapper to work correctly, the only change you can do is change the name of the fields.

In the example (Example of creating a Denodo Aracne wrapper), the name of the TITLE field is changed to DOCNAME and a field is added to contain the most relevant terms of the document (see section Adding Fields with the Most Relevant Terms).

Example of creating a Denodo Aracne wrapper
CREATE WRAPPER ARN aracneview3
    FOLDER = '/data sources/arn'
    DATASOURCENAME = aracnesearch
    HANDLERNAME = 'default'
    OUTPUTSCHEMA (
    TASK : 'java.lang.String' (OPT),
    PUBDATE : 'java.lang.String' (OPT),
    DOCNAME = 'TITLE' : 'java.lang.String' (OPT),
    ANCHORTEXT : 'java.lang.String' (OPT),
    SUMMARY : 'java.lang.String' (OPT),
    IDENTIFIER : 'java.lang.String' (OPT),
    URL : 'java.lang.String' (OPT),
    CONTENT : 'java.lang.String' (OPT),
    DESCRIPTION : 'java.lang.String' (OPT),
    MODIFIED : 'java.lang.String' (OPT),
    SEARCHABLECONTENT : 'java.lang.String' (OPT) EXTERN,
    LEVEL : 'java.lang.String' (OPT),
    TYPE : 'java.lang.String' (OPT),
    TITLEXML : 'java.lang.String' (OPT),
    SUMMARYXML : 'java.lang.String' (OPT),
    PATH : 'java.lang.String' (OPT),
    SCORE : 'java.lang.Float',
    MAXDOCS : 'java.lang.Integer' (OPT) EXTERN,
    SEARCHABLECONTENT_MAIN_TERM = 'SEARCHABLECONTENT_MAIN_TERM': ARRAY OF (
        SEARCHABLECONTENT_MAIN_TERM_REG: REGISTER OF (
            SEARCHABLECONTENT_SCORE : 'java.lang.Integer',
            SEARCHABLECONTENT_TERM : 'java.lang.String'
        )
    )MAINTERMS (SEARCHABLECONTENT ,10,( 'usualterm1' , 'usualterm2') )
);

The Syntax of the ALTER WRAPPER ARN statement (Aracne) shows the syntax of the command to modify an Aracne wrapper.

Syntax of the ALTER WRAPPER ARN statement (Aracne)
ALTER WRAPPER ARN <name:identifier>
    DATASOURCENAME = <name:identifier>
    HANDLERNAME = <literal>
    [ OUTPUTSCHEMA ( <field> [, <field> ]* ) ]
    [ FILTERMAINTERMLIST ( <literal> [, <literal> ]* ) ]

<field> ::=
    <name:identifier> = <mapping:literal> [ VALUE <literal> ] :
    <type:literal>
          [ ( { OBL | OPT } ) ] [ ( DEFAULTVALUE <literal> ) ] [ EXTERN ]
        | <name:identifier> = <mapping:literal> : ARRAY OF ( <register field> )
          [ <main terms constraint>]*
        | <name:register field>

<register field> ::=
        <name:identifier> = <mapping:literal> :
            REGISTER OF ( [ <field> [, <field> ]* ] )

<main terms constraint> ::=
    MAINTERMS ( <name:identifier>, <num_of_mainterms:integer>
    [, { ( <literal> [, <literal> ]* ) } ] )

Adding Fields with the Most Relevant Terms

Denodo Aracne can generate automatically the most relevant words of a document or a field according to the TFIDF (Term Frequency Inverse Document Frequency) relevance measurement. These terms can be accessed via additional fields in the Virtual DataPort wrapper, as described in this section.

For example, in Example of creating a Denodo Aracne wrapper a new attribute known as SEARCHABLECONTENT_MAIN_TERM is added to contain the most relevant terms of the SEARCHABLECONTENT index field. The new attribute must be of array of records-type (see section Management of Compound Values). Each record must contain two fields:

  • The relevant term. In this example, this takes the name of the index field, adding the suffix _TERM (SEARCHABLECONTENT_TERM).
  • Its position in the list of the most relevant. In this example, this takes the name of the index field, adding the suffix _SCORE (SEARCHABLECONTENT_SCORE). This is of integer type. The most relevant term will take position 1.

The modifier MAINTERMS must also be used to specify the contents of the new field. To do so, the following parameters can be specified:

  • Name (Mandatory). Name of the field involved. In this example, SEARCHABLECONTENT.
  • Number of main terms (Mandatory). Maximum number of relevant terms to be included for each document.
  • Filter main terms words (Optional). List of “usual words” (separated by commas) that must not appear among the most relevant terms for this field. Where Aracne generates any of those appearing in this list among the most relevant terms for the attribute contents, this would be eliminated from the list of relevant terms. It is important to note that only usual words specific to the application must be specified. The usual words in the language used such as articles, pronouns, etc. (commonly known as “stopwords”) are already eliminated by Denodo Aracne.

Furthermore, the Aracne wrapper creation syntax includes the FILTERMAINTERMS clause (see Syntax of the CREATE WRAPPER ARN statement (Aracne)). This clause allows for a list of usual words common to all fields in the base view to be specified. Once again, you do not have to worry about specifying usual words in the language used such as articles, pronouns, etc. (commonly known as “stopwords”), as they are already eliminated by Denodo Aracne.