Applies to:
Denodo 8.0
,
Denodo 7.0
,
Denodo 6.0
Last modified on: 25 Nov 2020
Tags:
Cost optimization
Scheduler jobs
Stored procedures
Goal
This document describes how to update the statistics of a view or multiple views using the stored procedures, GENERATE_STATS and GENERATE_STATS_FOR_FIELDS, and the Denodo Scheduler in order to automate the refresh of statistics.
Content
Denodo provides two stored procedures, "GENERATE_STATS" and "GENERATE_STATS_FOR_FIELDS" to update the statistics of a view. The first one, updates the statistics for all the fields in the view, while the second one updates the statistics for the fields that are specified as input parameters.
The syntax of each stored procedure is as follows:
Note: For JDBC base views and only when the statistics are gathered at the source database you could also use the stored procedure GENERATE_SMART_STATS_FOR_FIELDS. In this scenario, this stored procedure will be faster than the stored procedures used in this document.
In order to update the statistics for a view, the statistics option for that view must be enabled first. To do so follow these steps from the Virtual DataPort Administration Tool.
Once the statistics have been enabled, it is possible to manually gather the statistics by using the “Gather statistics for selected fields” option available in this same screen or running the GENERATE_STATS and GENERATE_STATS_FOR_FIELDS stored procedure from a VQL shell.
Let’s see some examples of how to gather the statistics from the VQL Shell:
call generate_stats('bv_local_user')
call generate_stats_for_fields('bv_local_user', {'name','code'})
To make this process automatic a Denodo Scheduler job can be created in a couple of minutes for any of the previous stored procedures described. So, we are going to see two options in order to automate the statistics updating of a view or a list of views depending on if you want to update the statistics for all the fields of a view or only for a set of fields.
Updating the statistics for all the fields in a view.
CALL GENERATE_STATS(‘@VIEW_NAME’)
When the job runs it will replace the parameter with each one of the values included in the list and it will update the statistics for each view.
Note that the general Mail Configuration has to be set in order to use the mail handler. To do that go to the “Administration > Server Settings > Mail Settings” menu.
Update the statistics for a set of fields
CALL GENERATE_STATS_FOR_FIELDS(‘@VIEW_NAME’, @FIELDS)
Note that you do not have to surround the input parameter FIELDS with the ‘ character.
VIEW_NAME;FIELDS
view_A;{ROW("field_A1"),ROW("field_A2")}
view_B;{ROW("field_B1")}
Note that you have to use the ‘;’ character in order to separate the field values for each row.
Note that the type for both fields will be ‘text’.
SELECT view_name, fields FROM bv_input_parameters
When the job runs it will replace the parameters for the parameterized query with the values of the fields for each returned row.It will execute as many calls to the store procedure as fetched rows.
Note that the general Mail Configuration has to be set in order to use the mail handler. To do that go to the “Administration > Server Settings > Mail Settings” menu.
Automatically updating statistics for all base views in a database.
Denodo Scheduler and the GET_VIEWS() function can be used to automatically generate statistics for a set of views in a database.
The Denodo Cost-Based optimizer requires statistics for all base views and for all derived views which include flatten operations or new derived attributes. However, deciding which views need to have their statistics updated automatically should be based on the amount of modification performed on the dataset. For example, a dataset containing customer transactions may require a periodic update, but a table containing countries may not. You can check the Estimating Post-Processing Costs section of the VDP Administration Guide for more information.
In this example we will show how to collect and update the statistics for all base views in a particular database. We will call the target database “<STAT_DATABASE>” in this tutorial.
Note: The process of gathering statistics can take a long time for large views. It can also induce significant workload for both Virtual DataPort and the data sources. You should keep in mind that you may need to apply some filters on the catalog query shown in this example; this is in order to reduce the number of views in the process and avoid updating statistics for views that do not need to be updated.
To create a job that updates the statistics for all base views:
CALL GENERATE_STATS(‘@VIEW_NAME’)
SELECT name AS VIEW_NAME FROM GET_VIEWS() WHERE (database_name = '<STAT_DATABASE>' AND view_type = 0)
This will select all views in the “<STAT_DATABASE>” which are base views; the alias of the “name” column to “VIEW_NAME” will automatically fill the “@VIEW_NAME” parameter. For more information about the “GET_VIEWS()” function see the GET_VIEWS() documentation in the Virtual DataPort VQL Guide.
Note: To select only base views that need statistics gathered, you can add filter statements to this query; for example, you could create a folder with the base views for which you want to have updated statistics and add a filter statement to the query to only update views in that folder:
SELECT name AS VIEW_NAME FROM GET_VIEWS() WHERE (database_name = ‘<STAT_DATABASE>’ AND view_type = 0) AND folder = ‘<FOLDER_NAME>’
The Trigger section after saving should look like this:
References
Virtual DataPort VQL Guide: GET_VIEWS
Advanced VQL Guide: GENERATE_STATS
Advanced VQL Guide: GENERATE_STATS_FOR_FIELDS
Scheduler Administration Guide: Time-based Job Scheduling