Aracne-type Job Extraction Section

Aracne-type jobs require a previously created Aracne-type data source to be specified that identifies the server that will process the corresponding crawling job. It is also necessary to specify the type of crawler to use. Denodo Aracne allows the following crawlers to be used:

  • WebBot. Crawls documents from Web sites, FTP servers, or file systems using the WebBot crawling module. In its configuration you need to indicate the initial URLs, the link and rewriting filters that must be applied, the level of exploration for each Web site, FTP server, or directory of a file system, and if the standard of robot exclusion must be respected (see section WebBot for more information).
  • IECrawler. Crawls documents from Web sites or FTP servers using the IECrawler module. In its configuration you need to indicate the initial URLs, the link and rewriting filters that must be applied, the level of exploration for the Web site or FTP server, and if the standard of robot exclusion must be respected (see section IECrawler for more information).
  • Global Search. Crawls the Virtual DataPort RESTful web service, by getting data from its views and following their associations. It works on one Virtual DataPort database at a time, and can be configured to take some of the database’s views as crawling seeds (otherwise, it starts crawling the RESTful web service’s home page for the selected database).
  • Mail. Gets e-mail messages (including attachments) from servers accessed using POP3 and/or IMAP protocols. The mail server to be connected to and the e-mail accounts to be indexed need to be specified (see section POP3/IMAP E-mail Server Crawling Configuration for more information).
  • Salesforce. Performs queries against entities from the CRM on-line Salesforce). Access is achieved using the Web Service API for the service (see section Crawling Configuration of Entities in SalesForce.com Accounts for more information).

Users can also create their own crawler to get data from a specific source type (see section Aracne Custom Crawlers). Custom crawlers are added by means of the Scheduler extensions system based on plugins (see section Plugins and JDBC Adapters). Once added, it will appear automatically in the crawlers’ selector for an ARN job, so that the user can select it.

Web Crawling and File Systems Configuration

WebBot is a crawling module capable of getting data from Web sites, FTP servers, and file systems. It visits the URLs which are provided as a starting point (in the case of file systems, these URLs will use the file protocol), it stores the retrieved documents, and extracts the links (files or subdirectories, in the case of FTP servers and file systems) that these contain to add them to the list of links which the crawler will visit. This process is repeated until all the URLs have been accessed or until the depth level defined to stop the crawling process has been reached.

WebBot allows regular expression filters to be defined (see section Using URL filters) that make the system only process those links that match some of the filters, rejecting all the others.

WebBot also allows rewriting links filters to be defined (see section URL rewritings). These filters, which will be described in detail on subsequent sections, are used to rewrite the URLs which match a given regular expression before adding them to the list of URLs that remain to be browsed.

IECrawler is a crawling module that uses a set of Internet browsers as “robots” similar to those used by humans to surf the Web, but changed and extended to allow the execution of automatic crawling processes.

The main added value of this approach is that it is capable of crossing links and downloading documents from any type of Web site, although it includes JavaScript, complex redirections, session identifiers, dynamic HTML, etc. This is due to its automatic navigation module that automatically emulates the navigation events that a human user would produce when browsing a Web site. The current implementation of IECrawler is based on Microsoft Internet Explorer technology.

WebBot

The first parameter that can be configured in this type of job is Robots exclusion. If this check box is checked, the job will respect the limitations set by the robot exclusion standard. This standard allows Web site administrators to indicate to the crawlers which parts of the Web site should not be accessed. It is a protocol that recommends, but does not require, and relies on the cooperation of all the Web robots. For this reason, it is advised that this property remain activated, configuring it with the value “Yes” (activated by default).

In this type of job, a list of Web sites or file systems to crawl can be configured. The link Add Site allows adding a new Web site or file system to be crawled. For each new site the following parameters can be specified:

  • Exploration Level: Indicates the maximum depth level for stopping the crawling process of a Web site, FTP server, or file system directory. It is also possible to specify whether the default configuration should be used.
  • Minimum and maximum number of workers: This indicates the initial number and the maximum number of crawlers to be run in parallel on the site while the job is being run.
  • URL: It allows indicating the initial URL for the crawling.
    • If the URL refers to a file system directory, a URL with the file protocol should be used (e.g. file:///C:/tmp).
    • If the URL indicated uses the FTP protocol, it must follow the format ftp://server/directory or ftp://user:password\@server/directory (the symbol ‘@’ must be preceded by the escape character ‘\’). Note that, when using the first URL format the authentication data can be provided using the parameters Login and Password. If the authentication data is not indicated (neither in the URL, nor using the Login and Password parameters), the connection to the server will be made using anonymous FTP.
  • Login: Allows configuring HTTP or FTP authentication to access the required URL when the HTTP or FTP protocol is used.
  • Password: Allows configuring HTTP or FTP authentication to access the required URL when the HTTP or FTP protocol is used.
  • Download initial URLs: Indicates if the crawler should store the pages provided as initial URLs in the repository.
  • Link and/or rewriting filters can be added (see sections Using URL filters and URL rewritings).

IECrawler

Configuration and use of IECrawler-type jobs are described below. IECrawler differentiates from WebBot in that the Web exploration processes work at a higher abstraction level. More specifically, IECrawler uses a group of Internet browsers for executing crawling processes. These are similar to those used by humans when Web browsing, but modified and extended to allow the execution of automatic crawling processes.

The main differences with regard to configuration are the details of the Web site to be explored. In this case, IECrawler only allows one Web site or FTP server to be configured. The following parameters can be configured:

  • Exploration level: indicates the maximum depth level for stopping the crawling process of a Web site or FTP server.

  • Maximum number of browsers: indicates the maximum number of crawlers (browsers) to be run in parallel on the site.

  • URL: initial URL for crawling. Indicates a URL to navigate to and its type. Multiple URLs can be added with the “Add URL” button. The types of URLs allowed are GET, POST, and NSEQL navigation sequence.

    If the URL indicated uses the FTP protocol, it must follow the format ftp://user:password\@server/directory (the symbol ‘@’ must be preceded by the escape character ‘\’). If the authentication data is not indicated, the connection to the server will be made using anonymous FTP.

  • Link and/or rewriting filters can be added (see sections Using URL filters and URL rewritings).

Using URL filters

In WebBot and IECrawler-type jobs there is a link filter configuration section. This type of filter allows configuring what links should be traversed by the crawling process depending on whether or not they satisfy certain regular expressions. Inclusion filters allow specifying regular expressions that should match with the URLs or the texts of the new links discovered by the crawling process. If a link discovered during the crawling process does not match with the specified expressions, the link will not be traversed and its associated document will not be downloaded (and, therefore, the possible links from the document will not be traversed either).

Exclusion filters can also be specified. In this case, the links that match the regular expression associated with the filter will be rejected.

It is important to highlight that the filters are applied in the order in which they are defined, whereby the process is stopped for a link the first time this matches the regular expression defined in one of the filters.

To add filters to the system the “Add new Link Filter” is used. For each filter the following parameters should be specified:

  • Pattern expression: The regular expression that defines the filter for the link. The supported syntax is described in section Regular Expressions for Filters.
  • Included: Indicates whether the links matching the regular expression should be included or rejected in the list of pages that the crawler should visit.
  • Apply pattern expression on anchor. If checked, the regular expression is applied to the link text instead of the URL.

Example: Suppose you only want to get the news pages from the Web site http://news.acme.com. We know that these documents contain the word “news” in their path and that the domain should not be considered. Thus, after pressing the “Add new Link Filter” button the following data should be entered:

  • Regular Expression: (.)*news(.)*
  • Included: Yes

Certain functions can be specified in the regular expression of the link filters. Aracne includes the function DateFormat for handling dates (see section Dateformat Function Syntax for a description of its syntax); new functions can also be added (see the Aracne Administration Guide).

The order or link filters is very important. They are processed in order. If only exclusion filters are specified, all URLs are discarded. If you want to specify only exclusion filters, you have to add, at the end of the link filters list, a inclusion link filter like this one: .*; so, by default, all URLs will be included except for those matching with the exclusion filters previously defined.

URL rewritings

In WebBot- and IECrawler-type jobs there is a rewriting filter configuration section which allows rewriting rules to be defined for the links.

To add rules of this type to the system press the “Add new Link rewrite” button, which displays the filter edit screen. This screen contains the elements indicated below:

  • Pattern expression: Defines the regular expression of the URLs of the links to be rewritten. See section Regular Expressions for Filters for a description of the required syntax.
  • Substitution: Defines the regular expression with which the link URL will be replaced. It can refer to fragments of the above-mentioned regular expression that correspond to groups of the matched regular expression (see section Regular Expressions for Filters for more information on groups). For example, to retrieve the i-th group the tag $i would have to be included in this regular expression.

Example: For example, a link rewriting filter can be defined to obtain news from the Web site news.acme.com if we know that all the useful content of the news page is obtained through the link Print news, i.e. by eliminating the advertisements and navigation menus. If, for example, the news pages have URLs such as:

http://news.acme.com/news/12/121465.html

and the link Print is as follows:

http://acme.news.com/news/print.php3?id=121465

then the filter will be like this:

  • Pattern: http://news.acme.com./news/(.)+/(.+).html
  • Substitution: http://news.acme.com/news/print.php3?id=$2

It is possible to specify certain functions in the pattern expression and in the substitution expression for the rewriting filters. Aracne includes the function DateFormat (see section Dateformat Function Syntax for a description of its syntax); it is also possible to add new functions (see the Aracne Administration Guide).

Global Search Crawling Configuration

Global Search is a crawling module (an extension of the WebBot crawler) capable of crawling the Virtual DataPort RESTful web service. This allows indexing and searching the information contained in a Virtual DataPort database.

The Global Search crawler works on one Virtual DataPort database at a time, and can be configured to take some of the database’s views as crawling seeds (otherwise, it starts crawling the RESTful web service’s home page for the selected database). Associations between views define links in the RESTful web service, which will be followed by the Global Search crawler, until all the views have been accessed or until the depth level defined to stop the crawling process has been reached.

Global Search allows regular expression filters to be defined (see section Using URL filters) that make the system only process those links that match some of the filters, rejecting all the others.

Configuration

To use the Global Search crawler you need to create an ARN job first. In the “Extraction section”, select an ARN data source and then choose “Global Search” as the crawler type.

Global configuration

The Global Search crawler can be configured with the following parameters:

Example of global configuration for global search crawler

Example of global configuration for global search crawler

  • Host: The host where the Virtual DataPort RESTful web service is located.
  • Port: The port where the Virtual DataPort RESTful web service can be accessed.
  • Https: Indicates if the Virtual DataPort RESTful web service should be accessed via HTTP (option not checked) or HTTPS (option checked).
  • Database: The Virtual DataPort database to be crawled.
  • Login: The user login to access the required database.
  • Password: The password to access the required database.
  • Max Workers: The maximum number of crawling threads that will be executed in parallel while the job is running. The default value is 3.
  • Min Workers: The initial number of crawling threads that will be executed in parallel when the job starts. The default value is 1.
  • Extraction Level: The maximum depth level for stopping the crawling process of the RESTful web service (each association between views adds one level of depth). The default value is 1 and the minimum value is 0.
  • Pagination Level: The maximum number of result pages after the first one to be explored for each view of the configured database that is reached by the crawler. The default value is 3, which means that a maximum of 1 + 3 result pages will be explored.
  • Max Results per Page. The maximum number of results to be retrieved by the crawler for each results page. If set to 0, all results will be retrieved. The default value is 1000.
  • Incremental Crawling Start Date. It is possible to perform incremental crawlings. This feature allows adding results that are newer than the specified date to a previous crawling (instead of crawling everything again). To do this, you can configure a start date (following the format YYYY-MM-DD’T’hh:mm:ss.SSSZ, for instance 2014-01-01T00:00:00.000+0000) or you can leave this parameter without a value (in this case, the job’s last execution time will be considered as the incremental crawling start date). In any case, you will need to add specific configuration for at least one view in order to use the incremental crawling feature (see sections Crawling seeds configuration and View configuration). Otherwise, this parameter’s value will be ignored.

The previous parameters define the crawler’s global configuration. In addition, you can define inclusion and exclusion filters, configure some views as the crawling seeds and add specific configuration for any view you require.

Inclusion / Exclusion filters

Click Add new Inclusion / Exclusion Filter to add a new URL filter (you can add any number of filters that will be applied in the order they are defined). These filters will determine which links are traversed by the crawler. You can configure the following parameters for each filter you define:

Inclusion/Exclusion filters

Inclusion/Exclusion filters

  • Filter: A regular expression that will be checked against the links discovered by the crawler (for example, (.)* will match any URL).
  • Inclusion: Whether this is an inclusion or exclusion filter. Filters are exclusive by default.

An inclusion filter will determine which links are traversed by the crawler, while an exclusion filter will determine which links are not (see section Using URL filters).

Crawling seeds configuration

Click Add new View to add a new crawling seed based on a VDP view (you can add any number of seeds). If no seeds are configured, the Global Search crawler starts crawling the RESTful web service’s home page for the selected database. You can configure the following parameters for each view:

Example of seed configuration for global search crawler

Example of seed configuration for global search crawler

  • View name: The name of the view to be configured as a crawling seed.

  • Max Workers: The maximum number of crawling threads that will be executed in parallel for this seed while the job is running. The default value is the crawler’s global configuration’s value for the parameter.

  • Min Workers: The initial number of crawling threads that will be executed in parallel for this seed when the job starts. The default value is the crawler’s global configuration’s value for the parameter.

  • Extraction Level: The maximum depth level for stopping the crawling process of this seed. The default value is 1 and the minimum value is 0.

  • Pagination Level: The maximum number of result pages after the first one to be explored for each view of the configured database that is reached by the crawler. The default value is the crawler’s global configuration’s value for the parameter.

  • Max Results per Page: The maximum number of results to be retrieved by the crawler for each results page of this seed. The default value is the crawler’s global configuration’s value for the parameter.

  • Order By Expression: Defines an expression that will be used for ordering the results obtained by the requests sent by the crawler. The expression can be composed by several field names separated by commas (“,”), for example: MyField,MyOtherField. You can also specify whether the ordering will be ascending or descending (ascending by default) by appending the reserved words ASC or DESC to each field name. For examples: MyField ASC,MyOtherField DESC. Note that, you cannot use, in this expression, a field with commas (“,”) in its name (see the section Known limitations).

  • Incremental Crawling Field Name: The name of the view’s date-type field that will be used for comparing its value with the configured incremental crawling start date (see Global configuration for details). Only those tuples whose value for the configured field is greater than the configured incremental crawling start date (or that have no value for the field) will be included in the crawling.

  • Incremental Crawling Start Date: The start date for incremental crawlings (see Global configuration for details). The default value is the crawler’s global configuration’s value for the parameter.

  • Incremental Crawling Filter: Defines an additional condition that needs to be satisfied by the tuples that passed the incremental crawling start date filter in order to be included in the crawling. You can use any expression that can appear in the WHERE clause of a VQL query (see section WHERE Clause of the VQL Guide for additional information). In addition, the {0} variable can be used to reference the job’s last execution date and time. This parameter will be ignored if an incremental crawling field name is not configured. An example of a valid filter would be:

    "MyOtherDate" > TO_DATE("yyyy-MM-dd'T'HH:mm:ss.SSSZ", "{0}")
    
  • Advanced Query Filter: Defines a condition that needs to be satisfied by the view’s tuples in order to be included in the crawling. You can use any expression that can appear in the WHERE clause of a VQL query (see the WHERE Clause section of the VQL Guide for additional information). For example: "MyField" > 3.

You may also define specific filters for the view by clicking Add new Inclusion / Exclusion Filter (see Global configuration for details). When there are specific filters defined for a view, global filters do not affect the view’s URLs.

View configuration

Click Add new View Configuration to add specific configuration for views that may be accessed during the crawling (you can add any number of view configurations). If you add a view configuration for a view that is also configured as a crawling seed, the view configuration’s parameters’ values take precedence (but the view will still be a crawling seed). You can configure the following parameters for each view:

Example of global search’s view configuration

Example of global search’s view configuration

  • View Name
  • Pagination Level
  • Max Results per Page
  • Order By Expression
  • Incremental Crawling Field Name
  • Incremental Crawling Start Date
  • Incremental Crawling Filter
  • Advanced Query Filter

See Crawling seeds configuration for details on these configuration parameters.

Exporting to an index

In order to be able to search the data crawled by the Global Search crawler, this data needs to be indexed by an ARN-Index exporter (see section Postprocessing Section (Exporters) for additional information on how to create an ARN-Index exporter).

To add an ARN-Index exporter to the job, go to the “Exporters” section, click New Exporter and select “ARN-Index”. You need to configure the following parameters:

Exporting to an index using the globalsearch\_fs filter

Exporting to an index using the globalsearch_fs filter

  • Filter sequence: Choose “globalsearch_fs” as the filter sequence. This sequence is included in the Scheduler’s default project and is designed to be used in Global Search crawler jobs. It contains a unicity filter and a useful content extractor filter (see section Filter Sequences).
  • Data source: Any ARN-Index data source.
  • Index name: Configure any index that is suitable for being used with the Global Search crawler. ARN-Index includes an index called “globalsearch” that can be used here. If you require additional indexes for being used with the Global Search crawler, you need to configure them so they use the “globalsearch” schema which is included in ARN-Index (see the “Administration of Indexes” section of the Aracne Administration Guide for additional information).
  • Clear index: If you check this option, the selected index will be emptied before the exporter is executed. Note that if you configure incremental crawling and this option is selected, the incremental crawling configuration will work only as a crawling filter.

Examples

In this section we will give you some specific examples of how to configure the Global Search crawler to achieve different goals.

Data model

We will consider a simple data model with the following entities:

  • Customer: id_customer (primary key), name, creation_date.
  • Order: id_order (primary key), id_customer (foreign key), description, date.
  • Case: id_case (primary key), id_customer (foreign key), description, date.
  • Lasttweets: id_tweet (primary key), id_customer (foreign key), tweet, date.

For the example, we will consider that the VDP view Lasttweets that corresponds to this entity has “id_customer” as an obligatory field, so it is mandatory to give a value to “id_customer” when executing a query on Lasttweets.

The relationships between these entities are depicted in the following diagram.

../../../../_images/DenodoScheduler.AdministratorGuide-25.png

Scenario 1 - Index everything

We intend to index all the data that can be found in our model. One way to configure the Global Search crawler to do this would be (see Configuration for additional information):

  • Set “Order” and “Case” as crawling seeds and set their “Extraction Level” parameter to 0.
  • Set “Customer” as a crawling seed and set its “Extraction Level” parameter to 1. Also, add an inclusion filter with the following regular expression: (.)*Lasttweets(.)*.

This way we are retrieving all the data contained in the Order, Case and Customer entities (as they are configured as crawling seeds, they are guaranteed to be explored by the Global Search crawler, regardless of the configured extraction level). We are retrieving Lasttweets’s data too because as there is a relationship between Customer and Lasttweets, the Virtual DataPort RESTful web service generates links that go from Customer’s tuples to Lasttweets’s tuples (and vice versa), and the extraction level for Customer is 1, so these links are followed by the Global Search crawler. The inclusion filter avoids the crawler to follow the links that go from Customer to Order and Case (as these entities will be crawled anyway, the filter optimizes the process).

Note that Lasttweets cannot be configured as a crawling seed because it is mandatory to give a value to its field “id_customer”. In this case, the RESTful web service does not show Lasttweets’s data unless a value for “id_customer” is specified, and this makes impossible to crawl its contents directly.

Now we would like to perform an incremental crawling so we only index data that has been created after Oct 1st, 2014. To do this we can follow these steps:

  • Set the “Incremental Crawling Start Date” parameter of the crawler’s global configuration to 2014-10-01T00:00:00.000+0000.
  • Set the “Incremental Crawling Field Name” parameter of the Order and Case configuration sections to date.
  • Set the “Incremental Crawling Field Name” parameter of the Customer configuration sections to creation_date.
  • Add a View Configuration for Lasttweets and set the “Incremental Crawling Field Name parameter” to date.
  • Make sure that the “Clear index” option is not checked in the ARN-Index exporter configuration section.

With this new configuration, only data that has been created after Oct 1st, 2014 will be added to our index.

Known limitations

The “Order By Expression” parameter’s value cannot reference a field with commas (“,”) in its name.

POP3/IMAP E-mail Server Crawling Configuration

This crawler allows retrieving the content (including the attached files) of the messages from one or several e-mail accounts of a server accessed by using either the POP3 or IMAP protocols.

The parameters specified for this crawler are:

  • Host: Name of the incoming e-mail server. The protocols allowed are POP3 and IMAP.
  • Accounts: E-mail accounts whose messages will be retrieved and indexed by Aracne. The username (User) and password (Password) need to be specified for each account.

Crawling Configuration of Entities in SalesForce.com Accounts

This crawler allows data contained in an Salesforce.com account to be accessed using its Web service.

The parameters specified for this crawler are:

  • Login. User ID used for authentication on Salesforce.com.
  • Password. User password used for authentication on Salesforce.com.
  • Element. Name of the Salesforce data entity to be queried (e.g. “Lead”).
  • Field name. Multivalued parameter that lets you specify the name of the fields that you wish to obtain in the query to the element.