You can translate the document:

Goal

This document describes how to update the statistics of a view or multiple views using the stored procedure, GET_STATS_FOR_FIELDS, and the Denodo Scheduler in order to automate the refresh of statistics.

Content

Denodo provides a stored procedure, "GET_STATS_FOR_FIELDS" to update the statistics of a view. The first one, updates the statistics for all the fields in the view, while the second one updates the statistics for the fields that are specified as input parameters.

The syntax of the stored procedure is as follows:

  1. GET_STATS_FOR_FIELDS (input_mode: text, input_database_name: text, input_view_name: text, input_fields: array, input_table_stats: boolean, input_no_delegate: boolean, input_return_queries: boolean, input_save: boolean)
  • Input mode: sets how the procedure gathers the statistics. Valid values
  • SMART_ONLY: the procedure gathers the statistics from the system tables of the database
  • SMART_THEN_ATSOURCE_THROUGH_VDP: the procedure gathers the statistics from the system tables of the database. Then, it executes a SELECT statement to gather the statistics that it cannot not obtain from the system tables of the database.
  • ATSOURCE_THROUGH_VDP_ONLY: the procedure does not query the system tables. Instead, it just executes a SELECT statement.
  • SMART_CACHE_ONLY:the procedure gathers the statistics from the system tables of the database used by VDP to cache the view
  • SMART_CACHE_THEN_ATCACHE_THROUGH_VDP:the procedure gathers the statistics from the system tables of the database used by VDP to cache the view. Then, it executes a SELECT statement to gather the statistics that it cannot not obtain from the system tables of the cache database. These aggregation queries will be executed by the database used by VDP to cache the view.
  • ATCACHE_THROUGH_VDP_ONLY:the procedure does not query the system tables. Instead, it just executes a SELECT statement. These aggregation queries will be executed by the database used by VDP to cache the view.
  • Input database name: name of the database to which view_name belongs. If NULL, the procedure looks for input_view_name in the current database.
  • Input view name: the procedure will gather the statistics of this view (with the same case).
  • Input fields: array of registers with the fields whose statistics you want to gather. Each register has these fields:
  • Field_name: name of the field
  • Stats: array of statistics that you want to collect: FIELD_COUNT_DISTINCT(Distinct values), FIELD_AVG_LEN(Average size), FIELD_MAX(Max value), FIELD_MIN(Min value) AND FIELD_NUM_NULLS(Null values)
  • Source_stat_name: name of the object in the database that holds the statistics of this view.  Some sources like Microsoft SQL Server, allow the user to generate multiple statistics over the same field or set of fields and store them in different objects.  If NULL, it obtains the statistics from the default object.  This value is ignored when it is not applicable.
  • Input table stats: if true, the procedure obtains the number of rows of the table. If false, it does not.
  • Input no delegate: if true, the procedure does not push down to the source any aggregation function when executing SELECT statements. If false, it does.
  • Input return queries: if true, the procedure will also return any statement or query that was executed to obtain the statistics of the table. If false, it will not.
  • Input save: if true, the procedure stores the gathered statistics of the table instead of returning them. If false, the stored procedure returns the statistics but they are not persisted.

In order to update the statistics for a view, the statistics option for that view must be enabled first. To do so follow these steps from the Virtual DataPort Design Studio.

  1. Open the view where statistics gathering will be enabled.
  2. Click on the “Options” button.
  3. Go to the “Statistics” tab.
  4. Check the “Enable statistics” box.
  5. Choose the fields for which the statistics will be generated selecting or deselecting them from the statistics table shown. (this configuration does not take effect when you use the GET_STATS_FOR_FIELDS stored procedure, where you specify the fields to be updated)
  6. Click on the Save button to confirm the new configuration.

Once the statistics have been enabled, it is possible to manually gather the statistics by using the “Gather statistics for selected fields” option available in this same screen or running the GET_STATS_FOR_FIELDS stored procedure from a VQL shell.

Let’s see some examples of how to gather the statistics from the VQL Shell:

  • Gathering statistics for a view called ‘bv_local_user’: 

CALL GET_STATS_FOR_FIELDS('ATSOURCE_THROUGH_VDP_ONLY','test_database','bv_local_user',NULL,FALSE,FALSE,FALSE,TRUE)

  • Gathering statistics only for the fields ‘name’ and ‘code’ of the ‘bv_local_user’ view:

CALL GET_STATS_FOR_FIELDS('ATSOURCE_THROUGH_VDP_ONLY','test_database','bv_local_user',{ROW('name',NULL,NULL), ROW('code',NULL,NULL)},FALSE,FALSE,FALSE,TRUE)

To make this process automatic a Denodo Scheduler job can be created in a couple of minutes for any of the previous stored procedures described. So, we are going to see two options in order to automate the statistics updating of a view or a list of views depending on if you want to update the statistics for all the fields of a view or only for a set of fields.

Updating the statistics for all the fields in a view.

  1. Create or select a project to include the new job.
  2. Create a new “VDP” job.
  1. Click on the “Jobs” tab.
  2. Select the “Add job > VDP” option.

  1. Set a name and a description for the job.

  1. Go to the “Extraction section” tab.
  2. Select the VDP data source that points to the virtual database where the views to be updated are created.
  3. Use the following parameterized query; it includes a parameter for the name of the views that will be updated:

CALL GET_STATS_FOR_FIELDS('ATSOURCE_THROUGH_VDP_ONLY','<VIRTUAL_DATABASE>','@VIEW_NAME',NULL,FALSE,FALSE,FALSE,TRUE)

  1. Create a list with the values that will be applied to the parameter with the names of the views. To do so, click on “NEW SOURCE” and select “list”. Select VIEW_NAME as “Query parameter”, “,” as Separator and enter the name of the views to update separated by commas as Values.

When the job runs it will replace the parameter with each one of the values included in the list and it will update the statistics for each view.

  1. Optionally, configure the “Retry section” to retry the execution of the job in case of error. For instance, enable retries for all queries if any error is returned, limited to 3 retries and with a delay of 100 milliseconds between executions.

  1. Optionally, configure the “Handlers section”. For instance, a mail handler can be added and  configured to be executed and configured only on the last retry, adding a recipient and sending the email only in case of error.

Note that the general Mail Configuration has to be set in order to use the mail handler. To do that go to the Administration > Server Settings > Mail Settingsmenu.

  1. Configure the “Triggers section” to set the execution times of the job. A cron expression can be easily added in a graphical way to specify the periodicity of the job execution. For instance, we can use the cron expression “0 0 9 ? * 7” for a weekly update every Saturday at 9AM.

  1. Click on the “Save” button in order to save the job.
  2. Finally, the job can be manually tested by launching it from the Job Details” page by clicking on the “More Options” menu icon () that appears when hovering over the “Processed (Tuples/Errors)” column as it is shown in the picture below, or in the “Jobs Management” panel clicking on the “Start” option.

Update the statistics for a set of fields

  1. Create or select a project to include the new job.
  2. Create a new “VDP” job.
  1. Click on the “Jobs” tab.
  2. Select the “Add job > VDP” option.

  1. Set a name and a description for the job.

  1. Go to the “Extraction section” tab.
  2. Select the VDP data source that points to the virtual database where the views to be updated are created.
  3. Set the parameterized query to be the following; it includes a parameter for the name of the view and a parameter to specify the names of the fields that will be updated in that view:

CALL GET_STATS_FOR_FIELDS('ATSOURCE_THROUGH_VDP_ONLY','<VIRTUAL_DATABASE>','@VIEW_NAME', @FIELDS , FALSE,FALSE,FALSE,TRUE)

Note that you do not have to surround the input parameter FIELDS with the ' character.

  1. Create a CSV file specifying for each row the values of a view name and an array with the field names that will be updated for that view. E.g:

VIEW_NAME;FIELDS

view_A;{ROW('field_A1',NULL,NULL),ROW('field_A2',NULL,NULL)}

view_B;{ROW('field_B1',NULL,NULL)}

Note that we are using the ‘;’ character in order to separate the field values for each row in this case.

  1. Create a delimited file data source in VDP for the previously created csv file. You have to specify the ‘;’ character as the Column delimiter and check the Header option as you can see in the following image.

  1. Create a base view  from the previous data source.

Note that the type for both fields will be ‘text’.

  1. Create a VDP data source in Scheduler for the parameterized query in order to retrieve the values for the VIEW_NAME and FIELDS parameters. To do so, click on “NEW SOURCE” and select “vdp”. Select the corresponding VDP data source that points to the virtual database in which the previous base view was created and we are going to use as source. Set as “Query” parameter the following sentence:

SELECT view_name, fields FROM bv_input_parameters

When the job runs it will replace the parameters for the parameterized query with the values of the fields for each returned row.It will execute as many calls to the store procedure as fetched rows.

  1. Optionally, configure the “Retry section” to retry the execution of the job in case of error. For instance, enable retries for all queries if any error is returned, limited to 3 retries and with a delay of 100 milliseconds between executions.

  1. Optionally, configure the “Handlers section”. For instance, a mail handler can be added and configured to be executed and configured only on the last retry, adding a recipient and sending the email only in case of error.

Note that the general Mail Configuration has to be set in order to use the mail handler. To do that go to the “Administration > Server Settings > Mail Settings” menu.

  1. Configure the “Triggers section” to set the execution times of the job. A cron expression can be easily added in a graphical way to specify the periodicity of the job execution. For instance, we can use the cron expression “0 0 9 ? * 7” for a weekly update every Saturday at 9AM.

  1. Click on the “Save” button in order to save the job.
  2. Finally, the job can be manually tested by launching it from the Job Details” page by clicking on the menu icon that appears in the “Processed (Tuples/Errors)” column as it is shown in the picture below, or in the “Jobs Management” panel clicking on the “Start” option.

Automatically updating statistics for all base views in a database.

Denodo Scheduler and the GET_VIEWS() function can be used to automatically generate statistics for a set of views in a database.

The Denodo Cost-Based optimizer requires statistics for all base views and for all derived views which include flatten operations or new derived attributes. However, deciding which views need to have their statistics updated automatically should be based on the amount of modification performed on the dataset. For example, a dataset containing customer transactions may require a periodic update, but a table containing countries may not. You can check the Estimating Post-Processing Costs section of the VDP Administration Guide for more information. 

In this example we will show how to collect and update the statistics for all base views in a particular database. We will call the target database “<STAT_DATABASE>” in this tutorial.

Note: The process of gathering statistics can take a long time for large views. It can also induce significant workload for both Virtual DataPort and the data sources. You should keep in mind that you may need to apply some filters on the catalog query shown in this example; this is in order to reduce the number of views in the process and avoid updating statistics for views that do not need to be updated.

To create a job that updates the statistics for all base views:

  1. Add a data source in Scheduler for the “<STAT_DATABASE>” database. This can be done by clicking on the “Data Sources” tab, clicking “Add Data Source”, and selecting “VDP” from the drop down menu.

  1. Configure the “Data source details” page to connect to the desired database. Make sure that the connection URI is configured to connect to the desired database or else the scheduler will not be able to find the correct views.

  1. Create a new “VDP” job, giving it a name and description. In the “Extraction Section” tab select the “<STAT_DATABASE>” as your data source and enter the parameterized query:

        CALL GET_STATS_FOR_FIELDS('ATSOURCE_THROUGH_VDP_ONLY','test_database','@VIEW_NAME',NULL,FALSE,FALSE,FALSE,TRUE)

  1. You now need to add a source for Scheduler to fill the variable parameter “@VIEW_NAME”. Click “New Source” and select “VDP”. Select “<STAT_DATABASE>” as the data source.

  1. In the “Query (non-parameterized)” section enter the query:

SELECT name AS VIEW_NAME FROM GET_VIEWS() WHERE (database_name = '<STAT_DATABASE>' AND view_type = 0)

This will select all views in the “<STAT_DATABASE>” which are base views; the alias of the “name” column to “VIEW_NAME” will automatically fill the “@VIEW_NAME” parameter. For more information about the “GET_VIEWS()” function see the GET_VIEWS() documentation in the Virtual DataPort VQL Guide.

Note: To select only base views that need statistics gathered, you can add filter statements to this query; for example, you could create a folder with the base views for which you want to have updated statistics and add a filter statement to the query to only update views in that folder:

        SELECT name AS VIEW_NAME FROM GET_VIEWS() WHERE (database_name = '<STAT_DATABASE>' AND view_type = 0) AND folder = '<FOLDER_NAME>'

  1. Add a trigger in the Triggers section. Scheduler creates time based triggers using CRON expressions. More information about CRON expressions can be found in the Time-based Job Scheduling Section of the Scheduler Administration Guide.

    Note: Statistics for views are based on average values. Therefore, they do not need to be updated every time there is a change in a view. Only update statistics when the change in the dataset may imply a change in the values of the statistics. It is recommended to perform this task sparingly during non-peak hours.

    As an example, we will set up the Trigger section of the Scheduler job to run the job once every month. In the Trigger section, you can either manually input the CRON expression or use the CRON section to create an expression with user friendly input fields. For our case, we use the CRON expression
    0 0 3 ? * 1L to have our job to run at 3 am on the last Sunday of the month.

        

        The Trigger section after saving should look like this:

  1. Click on the “Save” button and test the job by clicking the “Start” option.

References

GET_VIEWS

GET_STATS_FOR_FIELDS

Time-based Job Scheduling

VDP Data Sources

Mail configuration

Disclaimer
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here