Introduction¶
The Denodo Platform provides advanced functionalities for integrating data from disperse and heterogeneous sources that may be poorly structured.
Denodo Aracne facilitates crawling, indexing, and querying non-structured data in a wide range of formats.
The main characteristics of Denodo Aracne include:
Advanced Web crawling capable of processing Web pages of any level of complexity that include features such as JavaScript, dynamic HTML, authentication, complex redirections, pop-up menus, etc.
Crawling of FTP servers and file systems.
Possibility of retrieving the content of e-mail messages which are accessible via POP3 or IMAP (or the secure version of these protocols - POP3S and IMAPS, respectively).
Rapid indexing: an average of 200 MB/h.
Small size of indices: approximately 30% of the size of the indexed text.
Support for the most popular formats: HTML, text, XML, Microsoft Word, RSS (versions 0.91, 0.92, 1.0, and 2.0), PDF, Microsoft Excel, Microsoft PowerPoint, EML, etc.
Complex searches: support for operators AND, OR, NOT, +, -, use of brackets, use of wildcards, exact phrase searches, multifield searches (title, URL, etc.), similarity searches, etc.
Maintenance of indexes through the elimination of old, obsolete documents which are no longer accessible, etc.
The planning and configuration of the crawling tasks performed by Denodo Aracne is carried out via the Denodo Scheduler module. See the Scheduler Administration Guide for detailed information.