Google Search Wrappers

Virtual DataPort supports the creation of wrappers on search engines created using the Google Search tools.

The main parameters of the CREATE WRAPPER GS are the following:

  • SITECOLLECTIONS. This parameter is mandatory. It specifies, within the Google Search server, the collections on which to make the search. The collections are created by the Google Search server administrator. Its name is case-sensitive. It is possible to specify several collections separated by commas. In this case, the search will be made on all of them. Where an external server is accessed, the collection to be sought can normally be obtained by examining the value of the site parameter on the invocation URLs.
  • CLIENT: This parameter is optional. It identifies the client making the queries. The Google Search server can be configured to behave in a different manner, depending on the client to have issued the query.
  • LANGUAGES: This parameter is optional. If specified, only documents in the specified language will be returned.
  • NUMKEYMATCH: This parameter is optional. Google Search allows the administrator to manually determine the priority of the pages. This parameter receives an integer value of between 0 and 5, where 5 is the maximum priority. If this value is established, the searches made will only return the pages having the specified priority or higher.

As with the other wrappers, the schema of data returned by the wrapper can be specified with the parameter OUTPUTSCHEMA. In this case, the schema must include a series of fixed fields, and only their name may be modified. Each field is described below:

  • TITLE. Title of the document. This is of string type.
  • SUMMARY. Summary generated by Google Search for the document. This is of string type.
  • URL. Document URL. This is of string type.
  • MIMETYPE. MIME type of the document. This is of string type.
  • RATING. Priority assigned manually by the Google Search administrator for the document. This may take values of between 0 and 5, where 5 is the maximum priority. This is of integer type.
  • MAXDOCS. Field added by Virtual DataPort to restrict the maximum number of results returned by a search. This is of integer type.
  • METAS. Attribute of array of records-type (see section Management of Compound Values) that contains the metatags for the document. Each record has two string-type fields to indicate the name of the metatag (metakey) and its value (metavalue).
  • CONTENT. Contents of the document. This is the field normally used for searches. This is of string type.
  • SITE. This allows restricting the documents returned to those belonging to a certain domain (e.g. “acme.com”). This is of string type.
  • FILETYPE. Extension of the document file. This is of string type.

The wrapper creation statement also accepts the OR REPLACE modifier. Where specified, if there is already a wrapper with the same name, its definition is replaced by the new one.

The following figure shows an example of the creation of a Google Search wrapper. The wrapper fields must be those specified. For the statement to work correctly, it is only possible to change the name of the output fields. In the example, the name of the TITLE field is changed to DOCNAME.

Example of creating a Google Search wrapper
CREATE WRAPPER GS acme_com
    DATASOURCENAME = acme_com
    SITECOLLECTIONS (
    'Acme_com'
    )
OUTPUTSCHEMA (
  DOCNAME = 'TITLE' : 'java.lang.String' (OPT),
  SUMMARY : 'java.lang.String',
  URL : 'java.lang.String' (OPT),
  MIMETYPE : 'java.lang.String',
  RATING : 'java.lang.Integer',
  MAXDOCS : 'java.lang.Integer' (OPT) EXTERN,
  METAS: ARRAY OF (
       METAS: REGISTER OF (
           METAKEY : 'java.lang.String',
           METAVALUE : 'java.lang.String'
       )
  ),
  CONTENT : 'java.lang.String' (OPT) EXTERN,
  SITE : 'java.lang.String' (OPT) EXTERN,
  FILETYPE : 'java.lang.String' (OPT) EXTERN,
  LANGUAGE : 'java.lang.String'
)

The syntax of the wrapper modification statement is similar and is shown in Example of creating a Google Search wrapper.