General Architecture

Denodo Aracne is divided into two independent modules:

  • Aracne Server (ARN-CRAWLER): The crawling module is an automatic unstructured data retrieval tool for information that is available in the Web, file systems, e-mail servers, etc. (see Denodo Aracne Architecture). Denodo Aracne has a series of crawlers for different sources of unstructured data.
  • Aracne Search/Index Engine Server (ARN-INDEXER): The indexing and search module allows storing documents to subsequently carry out searches on them.

Denodo Aracne also includes an administration tool for configuring, managing, and searching indexes.

The normal way of using Denodo Aracne is through the Denodo Scheduler. This is achieved by defining ARN-type tasks for using any of the crawling engines implemented by Denodo Aracne or ARN-Index tasks (which enables automatic maintenance operations on the ARN-Indexer indexes, such as the elimination of documents which are old, obsolete, no longer accessible, etc.).

Furthermore, the Aracne Index Engine Server can also be used for exporting the results of any Denodo Scheduler task. Then, complex Boolean, keyword-based searches can be executed on the created index.

The figure below shows the Denodo Aracne architecture with its two servers–crawling and indexing/search–and their relation to Denodo Scheduler. Additionally, Denodo Aracne has its own indexing/query API (see section Denodo Aracne API - Search/Indexing).

Denodo Aracne Architecture

Denodo Aracne Architecture

The ARN-Crawler core is comprised of the following crawling robots:

  • WebBot and MSIECrawler crawl through the Web hypertext structure, starting with a group of initial URLs, and recursively retrieve all pages accessible from the source URL group. They also allow connecting to an FTP server and obtaining the information contained in all the files and subdirectories of a specified directory. Multiple languages are supported for crawled documents.

    WebBot is capable, moreover, of exploring a file system (even if located in a shared folder) considering a directory as an initial URL and extracting data contained in all its files and subdirectories.

  • POP3/POP3S/IMAP/IMAPS Crawler. This allows retrieving data of e-mails contained in servers accessible through POP3, POP3S, IMAP or IMAPS protocols. It includes support for attached files.

  • Salesforce.com Crawler. Allows the retrieval of data contained in entities of data accessible via an account with the on-line service Salesforce.com.

  • CustomCrawler. Allows extracting the data from a data source through the Java implementation provided by the Denodo Aracne administrator. This type of robot allows ad-hoc construction of a crawler for a specific source.

The configuration of each type of specific crawler is described in detail in the Scheduler Administration Guide, where the ARN extraction tasks are created. The same is applicable to the maintenance actions of ARN-Indexer.

The query engine (see Denodo Aracne Architecture) receives queries from users through either the Web interface or the Aracne search API, retrieves results relevant to this query using the data contained in the index, and displays the response obtained to the user in the form of a list of documents.

The indexing and search module allows:

  • Indexing documents in various formats through Denodo Scheduler: HTML, PDF, Microsoft Word, Excel, PowerPoint, RSS (versions 0.91, 0.92, 1.0, and 2.0), EML, etc.
  • Using “stemming” features to allow more reliable document searches, since matches are not limited to exact word searches; words with the same lemma/root will also match.
  • Multi-field searches. This way, queries on various parts of a document (title, summary, body, etc.) can be combined.
  • Various indexes to be held, which allows different theme search engines to be created.
  • Results ordered by relevance based on the TFIDF algorithm.
  • Advanced searches with operators +, -, *, AND, OR, fuzzy search for similar words, search by configurable proximity of terms, etc.