URL Unicity and Standardization Filters

The unicity and URL standardization filters act on fields that contain a URL that can be considered the primary key for a tuple.

The URL standardization filter transforms the field value specified in the parameter Input field to a standardized format to facilitate its comparison and stores it in the field specified in the parameter Output field. The URL standardization executed is as follows:

  • Both the protocol and the URL host are converted to lowercase.
  • The port number specification: 80 is deleted, if it exists.
  • The reference (“anchor”) to a section of an HTML page is deleted, if it exists.
  • The characters “/../” are deleted.
  • The session identifiers PHPSESSID and jsessionid regularly used in Web sites made from PHP and Java Server Pages technologies are deleted.

Although the URL standardization filter is applicable to tuples returned for any type of job, it is particularly oriented to ARN jobs, hence appearing by default as Input field the field “url” with the URL of the document obtained by crawling and as Output field the same field “url”.

The unicity filter is used to reject tuples with repeated URLs. The field name containing the URL is specified in the parameter Input field, and the name of the field that stores the filter output is specified in Output field. It is also possible to choose whether to discard documents or not when the input field does not exist. Thus, if the field specified in the Input field parameter does not exist in the document, and the check box Discard document if field does not exist is not checked, the tuple will not be rejected; if it is checked, it will.

Optionally, the unicity filter can be configured using the following parameters:

  • Parameter to be removed: Allows irrelevant parameters of the URL to be deleted. Two identical URLs, except for the value of these parameters, shall be regarded as the same for purposes of unicity.
  • Key parameter: Allows key parameters to be specified of the URL acting as an identifier. Two URLs that take the same value for these parameters will be considered the same for purposes of unicity, regardless of the value that other parameters take.
  • Scope: The value indicated for this parameter will be added to all the URLs of the processed documents and will be taken into account for unicity checks (two identical identifiers but with a different SCOPE will be considered different). The most common use for this parameter is to avoid documents with the same URL but extracted by different jobs to overwrite each other (the identifier field value–which is often the url in this type of job–is used as primary key in the ARN-Index scheme by default). To avoid this problem just assign them the job name as a value (or any other value that does not appear in the Scope of any other job).

Although the unicity filter is applicable to tuples returned for any type of job, it is particularly oriented to ARN jobs, hence appearing by default as Input field the field “url” with the URL of the document obtained by crawling and as Output field the field “identifier.

Example: Suppose you want to get news from news.acme.com and it is known that the pages for each individual item have URLs of the form

http://news.acme.com/servlet/ContentServer?inifile=futuretense.ini&cid=1145997360218&arglink=nolink

http://news.acme.com/servlet/ContentServer?inifile=futuretense.ini&cid=1145997361017&arglink=nolink

where the parameter cid acts as an identifier for each news item and the rest of the URL parameters do not affect the documents of interest for the Acme job. The unicity filter would be created with the following values:

  • Key parameter: cid
  • Parameter to be removed: inifile
  • Parameter to be removed: arglink
  • Scope: acme

The Scope parameter is configured with the name of the job to restrict the unicity checks to documents downloaded by this job.

These filters should form part of all filter strings created in Denodo-Aracne-type jobs.