Google Search Wrappers¶
Note
The Google Search feature is deprecated and it may be removed in future major versions of the Denodo Platform.
The section Features Deprecated in Virtual DataPort 7.0 lists all the features that are deprecated.
Virtual DataPort supports the creation of wrappers on search engines created using the Google Search tools.
The main parameters of the CREATE WRAPPER GS
are the following:
SITECOLLECTIONS
. This parameter is mandatory. It specifies, within the Google Search server, the collections on which to make the search. The collections are created by the Google Search server administrator. Its name is case-sensitive. It is possible to specify several collections separated by commas. In this case, the search will be made on all of them. Where an external server is accessed, the collection to be sought can normally be obtained by examining the value of the site parameter on the invocation URLs.CLIENT
: This parameter is optional. It identifies the client making the queries. The Google Search server can be configured to behave in a different manner, depending on the client to have issued the query.LANGUAGES
: This parameter is optional. If specified, only documents in the specified language will be returned.NUMKEYMATCH
: This parameter is optional. Google Search allows the administrator to manually determine the priority of the pages. This parameter receives an integer value of between 0 and 5, where 5 is the maximum priority. If this value is established, the searches made will only return the pages having the specified priority or higher.
As with the other wrappers, the schema of data returned by the wrapper
can be specified with the parameter OUTPUTSCHEMA
. In this case, the
schema must include a series of fixed fields, and only their name may be
modified. Each field is described below:
TITLE
. Title of the document. This is of string type.SUMMARY
. Summary generated by Google Search for the document. This is of string type.URL
. Document URL. This is of string type.MIMETYPE
. MIME type of the document. This is of string type.RATING
. Priority assigned manually by the Google Search administrator for the document. This may take values of between 0 and 5, where 5 is the maximum priority. This is of integer type.MAXDOCS
. Field added by Virtual DataPort to restrict the maximum number of results returned by a search. This is of integer type.METAS
. Attribute of array of records-type (see section Management of Compound Values) that contains the metatags for the document. Each record has two string-type fields to indicate the name of the metatag (metakey
) and its value (metavalue
).CONTENT
. Contents of the document. This is the field normally used for searches. This is of string type.SITE
. This allows restricting the documents returned to those belonging to a certain domain (e.g. “acme.com”). This is of string type.FILETYPE
. Extension of the document file. This is of string type.
The wrapper creation statement also accepts the OR REPLACE
modifier.
Where specified, if there is already a wrapper with the same name, its
definition is replaced by the new one.
CREATE [ OR REPLACE ] WRAPPER GS <name:identifier>
[ FOLDER = <literal> ]
DATASOURCENAME = <name:identifier>
SITECOLLECTIONS ( <literal> [, <literal> ]* )
[ CLIENT = <literal> ]
[ LANGUAGES ( <literal> [, <literal> ]* ) ]
[ NUMKEYMATCH = <integer> ]
[ OUTPUTSCHEMA ( <field> [, <field> ]* ) ]
<field> ::=
<name:identifier> [ = <mapping:literal> ] : <type:literal>
[ ( { OBL | OPT } ) ]
[ ( DEFAULTVALUE <literal> ) ]
[ EXTERN ]
| <name:identifier> [ = <mapping:literal> ] :
ARRAY OF ( <register field> )
[ ( DEFAULTVALUE <literal> ) ]
[ EXTERN ]
<register field> ::=
<name:identifier> [ = <mapping:literal> ] :
REGISTER OF ( [ <field> [, <field> ]* ] )
[ ( DEFAULTVALUE <literal> ) ]
[ EXTERN ]
The following figure shows an example of the creation of a Google Search
wrapper. The wrapper fields must be those specified. For the statement
to work correctly, it is only possible to change the name of the output
fields. In the example, the name of the TITLE
field is changed to
DOCNAME
.
CREATE WRAPPER GS acme_com
DATASOURCENAME = acme_com
SITECOLLECTIONS (
'Acme_com'
)
OUTPUTSCHEMA (
DOCNAME = 'TITLE' : 'java.lang.String' (OPT),
SUMMARY : 'java.lang.String',
URL : 'java.lang.String' (OPT),
MIMETYPE : 'java.lang.String',
RATING : 'java.lang.Integer',
MAXDOCS : 'java.lang.Integer' (OPT) EXTERN,
METAS: ARRAY OF (
METAS: REGISTER OF (
METAKEY : 'java.lang.String',
METAVALUE : 'java.lang.String'
)
),
CONTENT : 'java.lang.String' (OPT) EXTERN,
SITE : 'java.lang.String' (OPT) EXTERN,
FILETYPE : 'java.lang.String' (OPT) EXTERN,
LANGUAGE : 'java.lang.String'
)
The syntax of the wrapper modification statement is similar and is shown in Example of creating a Google Search wrapper.
ALTER WRAPPER GS <name:identifier>
DATASOURCENAME = <name:identifier>
SITECOLLECTIONS ( <literal> [, <literal> ]* )
[ CLIENT = <literal> ]
[ LANGUAGES ( <literal> [, <literal> ]* ) ]
[ NUMKEYMATCH = <integer> ]
[ OUTPUTSCHEMA ( <field> [, <field> ]* ) ]
<field> ::= (see Example of creating a Google Search wrapper)