Data Schema Generated by the Different Types of Extraction Jobs¶
All the extraction jobs (but VDPCache) return the following fields in all its tuples, in addition to its own fields:
- _$job_project (text). Name of the project pertaining to the job.
- _$job_name(text). Name of the job.
- _$job(numerical). Identifier of the job.
- _$job_start_time (numerical). Time (in milliseconds) when the job was first executed,
- _$job_retry_start_time (date). Time at which the current job execution started.
- _$job_retry_count (numerical). Number of the current retry execution.
Regarding to Aracne jobs, all the crawlers include the following fields. In the case of WebBot and IECrawler, all these fields will have a non-null value, while in the rest of the crawlers some of them may have null values:
- url(text). Represents the URL for the document obtained.
- path(text). Represents the path in the local file system
associated with the document. The path is relative to
- title(text). The title of the document in the case of HTML. Value of the ‘Title’ tag for RSS documents.
- charset(text). Document encoding obtained from the server’s response or from metadata on the document.
- mimetype(text). The document’s MIME type. This information is obtained from the server’s response. In the absence of this data, the document is analyzed to try to detect it automatically
- anchortext(text). In the case of web crawling, text of the link that pointed to the document.
- binarydata(binary). Binary representation of the document.
The fields returned by the rest of the crawlers are as follows:
- WebBot. When it is used to crawl a file system or FTP server, the tuples also include one additional field called filename that specifies the file name of the document.
- Mail. Generates the fields url, path, binarydata, and mimetype. The url field encapsulates data from the server and user account queried. The path field points to a file that stores the content of the mail (with attachments) in .eml format and the binarydata field contains the file content. The value of the mimetype field is always “message/rfc822”. The folder field specifying the name of the folder from which the message was extracted may also be returned.
- Salesforce. Generates the url field, and the mimetype field value is always “Structured Textual Content”. It will also generate a field for each field specified in the Field Name parameter from the query on Salesforce. All these fields will be considered as being of text type.
Additionally, for RSS-type documents the following fields are added: pubdate(text), categories(collection of texts), and description(text). They contain the corresponding value for those fields of the RSS item.
Regarding to the rest of the extraction jobs (ITP, JDBC, and VDP), all the fields retrieved from their respective data sources are added.