Aracne Custom Crawlers

To create a new custom crawler the interface com.denodo.crawler.Crawler needs to be implemented. This interface has the following methods:

  • execute. Method invoked by ARN to execute the crawler.
  • stop. Method invoked by Scheduler to stop the execution of the crawler.

The execution of the crawler must provide the results to Aracne in the form of com.denodo.crawler.data.CrawlDocument objects using the add methods from com.denodo.crawler.data.DataManager.

package com.denodo.crawler.data;

public interface DataManager {

    public void add(Collection documents);

    public void add(Collection documents);

    public void addEvent(CrawlEvent event);

    public void addEvents(Collection events);

    public void close();

    public void setMappingWriter(MappingRepository writer);

    public void setRepositoryWriter(FileRepository writer);

}

If during the execution of the custom crawler any event or error occurs which Aracne needs to be informed about, the addEvent or addEvents method from com.denodo.crawler.data.DataManager must be invoked.

The Aracne API for the creation of custom crawlers also allows a repository to be built that stores copies of the data obtained by the crawler. To do this, if the “binarydata” field from CrawlDocument is not empty, the contents of the document are stored in the repository. The path for this repository would be that indicated by the “path” field, if applicable; otherwise, that indicated by the encoded “url” field.

For more information please refer to the Denodo Aracne Javadoc documentation and the example of SalesforceCrawler in DENODO_HOME/samples/arn/crawler-api.