DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE¶
Description
The stored procedure DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE
scans the distributed storage, starting from a folder given by the user, and for each Parquet dataset it finds, in a folder or subfolder, creates the corresponding base view in Denodo VDP. It also creates the corresponding Hive table in the Denodo Embedded MPP.
Syntax
DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE (
dir_to_explore : varchar
, duplicate_strategy : varchar
, sample_size : numeric
, read_from : varchar
, included_pattern : varchar
, baseview_database : varchar
, baseview_folder : varchar
, baseview_prefix : varchar
, embedded_mpp_schema : varchar
, analyze_tables : boolean
, embedded_database : varchar
, embedded_name : varchar
)
dir_to_explore
: directory of the distributed storage path (s3a
,wasb
,abfs
), to start the metadata discovery process.duplicate_strategy
(optional): chooses what to do with the metadata that already exists in the Denodo Embedded MPP and in Denodo VDP. Possible values areNEVER_REPLACE
,REPLACE_IF_CHANGES
,ALWAYS_REPLACE
. If no value is provided for the parameter,NEVER_REPLACE
is selected.NEVER_REPLACE
: it will not replace the Hive tables in the Embedded MPP and base views at Denodo VDP if they already exist.REPLACE_IF_CHANGES
: it will replace the Hive tables in the Embedded MPP, and their corresponding base views in VDP, if any kind of change is detected in the existing Hive tables: new columns, deleted columns, different types of columns, etc. This may affect the existing base views in Denodo that have been modified by the user, such as renamed fields, modifications in field type, cache configuration… since the base views will be replaced, and this kind of changes will be lost. Our recommendation is to always leave Denodo base views as they are and make changes only in views that derive from these base views.ALWAYS_REPLACE
: it will always replace the Hive tables in the Embedded MPP, and their corresponding base views in VDP if they already exist.
sample_size
(optional): establishes the maximum number of data files that will be analyzed to handle Schema Evolution. Default value:1
.read_from
(optional): allows to analyze the files from the desired timestamp (in UTC). Source files will be analyzed from the specified timestamp from older to newest until reaching the file limit set by the parametersample_size
. If no value is supplied to this parameter, source data files will be processed from newest to oldest until reaching the mentioned file limit.included_pattern
(optional): a regular expression that will select only certain files or folders in the scanning process. For example:(.*)invoices(.*)
. Default value:null
.baseview_database
: all base views will be created in this database.baseview_folder
: all base views created by the stored procedure will be placed in this folder when this parameter is specified. The folder will be created by the stored procedure if necessary.baseview_prefix
(optional): all base views created by the stored procedure will be prefixed with this value when this parameter is specified. Otherwise, the base view created will not have prefix.embedded_mpp_schema
: schema of the Embedded MPP where Denodo should create the Hive table associated to each base view. Take into account the default schema of the Embedded MPP is calleddefault
and you can create new schemas using the stored procedure CREATE_SCHEMA_ON_SOURCE.analyze_tables
(optional): whether to gather statistical information about the data to allow cost-based optimizations. Consider this step can be time consuming. If this option is not selected we recommend that you collect the statistics afterwards following the instructions in section Gathering the Statistics of Views. Default value:false
.embedded_database
(optional): the database where the Embedded MPP data source is located. Default value:admin_denodo_mpp
.embedded_name
(optional): the Embedded MPP data source name. Default value:embedded_mpp
.
The procedure returns these fields:
route_explored
: absolute path of the data in the object storage.catalog_name
: catalog of the Embedded MPP to use for the creation of the Hive table associated to the base view.base_view_name
: name of the base view in VDP.vql
: VQLs used to create the view.already_exists
:true
if there exists a view on the same path,false
otherwise.status
: a numeric code summarizing the result of this route analysis. Possible values:0
: the view did not exist and it was created successfully.1
: the view already existed and it was replaced because the replace strategy isALWAYS_REPLACE
.2
: the view already existed and it was replaced because the schema was different and the replace strategy isREPLACE_IF_CHANGES
.3
: the view was not created because it already existed in the same path and the replace strategy isNEVER_REPLACE
.4
: the view was not created because it already existed in the same path with the same schema and the replace strategy isREPLACE_IF_CHANGES
.-1
: there was an error creating the view.-2
: the view was created successfully, but it was not possible to gather the statistics.
details
: message explaining the status.
Privileges Required
The user that executes this procedure must have at least CREATE VIEW
on the selected database and must also have WRITE
and EXECUTE
privileges on
the Embedded MPP data source. If the folder specified does not exist, the user must also have CREATE FOLDER
on the selected database.
Example
select * from discover_object_storage_mpp_procedure()
where
dir_to_explore = 's3a://bucket/dir/' and
duplicate_strategy = 'ALWAYS_REPLACE' and
sample_size = 1 and
read_from = null and
included_pattern = null and
baseview_database = 'discover_object_storage_mpp_s3_db' and
baseview_folder = '/example' and
baseview_prefix = 'bv_' and
embedded_mpp_schema = 'default' and
analyze_tables = false and
embedded_database = 'discover_object_storage_mpp_s3_db' and
embedded_name = 'embedded_mpp_s3';
This will create base views for all the containing tables in s3a://bucket/dir/
. These base views will be created in /example
folder inside discover_object_storage_mpp_s3_db
database and with bv_
prefix. If the base view already exists,
it will be replaced. These views will be created on default
schema on embedded_mpp_s3
Embedded MPP data source.