How to update the statistics of a view automatically

Applies to: Denodo 8.0 , Denodo 7.0 , Denodo 6.0
Last modified on: 20 May 2020
Tags: Scheduler jobs Stored procedures Cost optimization

Download document

You can translate the document:

Goal

This document describes how to update the statistics of a view or multiple views using the stored procedures, GENERATE_STATS and GENERATE_STATS_FOR_FIELDS, and the Denodo Scheduler in order to automate the refresh of statistics.

Content

Denodo provides two stored procedures, "GENERATE_STATS" and "GENERATE_STATS_FOR_FIELDS" to update the statistics of a view. The first one, updates the statistics for all the fields in the view, while the second one updates the statistics for the fields that are specified as input parameters.

The syntax of each stored procedure is as follows:

  1. GENERATE_STATS (view name: text)
  • view name: the procedure will gather the statistics of the view with this name (with the same case).

  1. GENERATE_STATS_FOR_FIELDS (view name: text, fields list: array)
  • view name: the procedure will gather the statistics of the view with this name (with the same case).
  • fields list: array of field names that the procedure will obtain its statistics for. When you specify only some fields of the view, the procedure does not delete the statistics of the other fields of the view that have already been gathered. If NULL, the procedure gathers the statistics of all the fields of the view. If an empty array (i.e {}), the procedure gathers the number of rows of the view, but does not obtain the statistics of any field.

In order to update the statistics for a view the statistics option for that view must be enabled first. To do so follow these steps from the Virtual DataPort Administration Tool.

  1. Open the view where statistics gathering will be enabled.
  2. Click on the “Options” button.
  3. Go to the “Statistics” tab.
  4. Check the “Enable statistics” box.
  5. Choose the fields for which the statistics will be generated selecting or deselecting them from the statistics table shown. (this configuration does not take effect when you use the GENERATE_STATS_FOR_FIELDS stored procedure, where you specify the fields to be updated)
  6. Click on the Save button to confirm the new configuration.

Once the statistics have been enabled, it is possible to manually gather the statistics by using the “Gather statistics for selected fields” option available in this same screen or running the GENERATE_STATS and GENERATE_STATS_FOR_FIELDS stored procedure from a VQL shell.

Let’s see some examples of how to gather the statistics from the VQL Shell:

  • Gathering statistics for a view called ‘bv_local_user’: 

call generate_stats('bv_local_user')

  • Gathering statistics only for the fields ‘name’ and ‘code’ of the ‘bv_local_user’ view:

call generate_stats_for_fields('bv_local_user', {'name','code'})

To make this process automatic a Denodo Scheduler job can be created in a couple of minutes for any of the previous stored procedures described. So, we are going to see two options in order to automate the statistics updating of a view or a list of views depending on if you want to update the statistics for all the fields of a view or only for a set of fields.

Updating the statistics for all the fields in a view.

  1. Create or select a project to include the new job.
  2. Create a new “VDP” job.
  1. Click on the “Jobs” tab.
  2. Select the “NEW JOB > VDP” option.

  1. Set a name and a description for the job.

  1. Go to the “Extraction section” tab.
  2. Select the VDP data source that points to the virtual database where the views to be updated are created.
  3. Use the following parameterized query; it includes a parameter for the name of the views that will be updated:

CALL GENERATE_STATS(‘@VIEW_NAME’)

  1. Create a list with the values that will be applied to the parameter with the names of the views. To do so, click on “NEW SOURCE” and select “list”. Select VIEW_NAME as “Query parameter”, “,” as Separator and enter the name of the views to update separated by commas as Values.

When the job runs it will replace the parameter with each one of the values included in the list and it will update the statistics for each view.

  1. Optionally, configure the “Retry section” to retry the execution of the job in case of error. For instance, enable retries for all queries if any error is returned, limited to 3 retries and with a delay of 100 milliseconds between executions.

  1. Optionally, configure the “Handlers section”. For instance, a mail handler can be added and  configured to be executed and configured only on the last retry, adding a recipient and sending the email only in case of error.

Note that the general Mail Configuration has to be set in order to use the mail handler. To do that go to the Administration > Server Settings > Mail Settingsmenu.

  1. Configure the “Triggers section” to set the execution times of the job. A cron expression can be easily added in a graphical way to specify the periodicity of the job execution. For instance, we can use the cron expression “0 0 9 ? * 7” for a weekly update every Saturday at 9AM.

  1. Click on the “Save” button in order to save the job.
  2. Finally, the job can be manually tested by launching it from the Job Details” page by clicking on the “More Options” menu icon () that appears when hovering over the “Processed (Tuples/Errors)” column as it is shown in the picture below, or in the “Jobs Management” panel clicking on the “Start” option.

Update the statistics for a set of fields

  1. Create or select a project to include the new job.
  2. Create a new “VDP” job.
  1. Click on the “Jobs” tab.
  2. Select the “NEW JOB > VDP” option.

  1. Set a name and a description for the job.

  1. Go to the “Extraction section” tab.
  2. Select the VDP data source that points to the virtual database where the views to be updated are created.
  3. Set the parameterized query to be the following; it includes a parameter for the name of the view and a parameter to specify the names of the fields that will be updated in that view:

CALL GENERATE_STATS_FOR_FIELDS(‘@VIEW_NAME’, @FIELDS)

Note that you do not have to surround the input parameter FIELDS with the  character.

  1. Create a CSV file specifying for each row the values of a view name and an array with the field names that will be updated for that view. E.g:

VIEW_NAME;FIELDS

view_A;{ROW("field_A1"),ROW("field_A2")}

view_B;{ROW("field_B1")}

Note that you have to use the ‘;’ character in order to separate the field values for each row.

  1. Create a delimited file data source in VDP for the previously created csv file. You have to specify the ‘;’ character as the Column delimiter and check the Header option as you can see in the following image.

  1. Create a base view  from the previous data source.

Note that the type for both fields will be ‘text’.

  1. Create a VDP data source in Scheduler for the parameterized query in order to retrieve the values for the VIEW_NAME and FIELDS parameters. To do so, click on “NEW SOURCE” and select “vdp”. Select the corresponding VDP data source that points to the virtual database in which the previous base view was created and we are going to use as source. Set as “Query” parameter the following sentence:

SELECT view_name, fields FROM bv_input_parameters

When the job runs it will replace the parameters for the parameterized query with the values of the fields for each returned row.It will execute as many calls to the store procedure as fetched rows.

  1. Optionally, configure the “Retry section” to retry the execution of the job in case of error. For instance, enable retries for all queries if any error is returned, limited to 3 retries and with a delay of 100 milliseconds between executions.

  1. Optionally, configure the “Handlers section”. For instance, a mail handler can be added and configured to be executed and configured only on the last retry, adding a recipient and sending the email only in case of error.

Note that the general Mail Configuration has to be set in order to use the mail handler. To do that go to the “Administration > Server Settings > Mail Settings” menu.

  1. Configure the “Triggers section” to set the execution times of the job. A cron expression can be easily added in a graphical way to specify the periodicity of the job execution. For instance, we can use the cron expression “0 0 9 ? * 7” for a weekly update every Saturday at 9AM.

  1. Click on the “Save” button in order to save the job.
  2. Finally, the job can be manually tested by launching it from the Job Details” page by clicking on the menu icon that appears in the “Processed (Tuples/Errors)” column as it is shown in the picture below, or in the “Jobs Management” panel clicking on the “Start” option.

Automatically updating statistics for all base views in a database.

Denodo Scheduler and the GET_VIEWS() function can be used to automatically generate statistics for a set of views in a database.

The Denodo Cost-Based optimizer requires statistics for all base views and for all derived views which include flatten operations or new derived attributes. However, deciding which views need to have their statistics updated automatically should be based on the amount of modification performed on the dataset. For example, a dataset containing customer transactions may require a periodic update, but a table containing countries may not. You can check the Estimating Post-Processing Costs section of the VDP Administration Guide for more information. 

In this example we will show how to collect and update the statistics for all base views in a particular database. We will call the target database “<STAT_DATABASE>” in this tutorial.

Note: The process of gathering statistics can take a long time for large views. It can also induce significant workload for both Virtual DataPort and the data sources. You should keep in mind that you may need to apply some filters on the catalog query shown in this example; this is in order to reduce the number of views in the process and avoid updating statistics for views that do not need to be updated.

To create a job that updates the statistics for all base views:

  1. Add a data source in Scheduler for the “<STAT_DATABASE>” database. This can be done by clicking on the “Data Sources” tab, clicking “Add Data Source”, and selecting “VDP” from the drop down menu.

  1. Configure the “Data source details” page to connect to the desired database. Make sure that the connection URI is configured to connect to the desired database or else the scheduler will not be able to find the correct views.

  1. Create a new “VDP” job, giving it a name and description. In the “Extraction Section” tab select the “<STAT_DATABASE>” as your data source and enter the parameterized query:

        CALL GENERATE_STATS(‘@VIEW_NAME’)

  1. You now need to add a source for Scheduler to fill the variable parameter “@VIEW_NAME”. Click “New Source” and select “VDP”. Select “<STAT_DATABASE>” as the data source.

  1. In the “Query (non-parameterized)” section enter the query:

SELECT name AS VIEW_NAME FROM GET_VIEWS() WHERE (database_name = '<STAT_DATABASE>' AND view_type = 0)

This will select all views in the “<STAT_DATABASE>” which are base views; the alias of the “name” column to “VIEW_NAME” will automatically fill the “@VIEW_NAME” parameter. For more information about the “GET_VIEWS()” function see the GET_VIEWS() documentation in the Virtual DataPort VQL Guide.

Note: To select only base views that need statistics gathered, you can add filter statements to this query; for example, you could create a folder with the base views for which you want to have updated statistics and add a filter statement to the query to only update views in that folder:

        SELECT name AS VIEW_NAME FROM GET_VIEWS() WHERE (database_name = ‘<STAT_DATABASE>’ AND view_type = 0) AND folder = ‘<FOLDER_NAME>’

  1. Add a trigger in the Triggers section. Scheduler creates time based triggers using CRON expressions. More information about CRON expressions can be found in the Time-based Job Scheduling Section of the Scheduler Administration Guide.

    Note: Statistics for views are based on average values. Therefore, they do not need to be updated every time there is a change in a view. Only update statistics when the change in the dataset may imply a change in the values of the statistics. It is recommended to perform this task sparingly during non-peak hours.

    As an example, we will set up the Trigger section of the Scheduler job to run the job once every month. In the Trigger section, you can either manually input the CRON expression or use the CRON section to create an expression with user friendly input fields. For our case, we use the CRON expression
    0 0 3 ? * 1L to have our job to run at 3 am on the last Sunday of the month.

        

        The Trigger section after saving should look like this:

  1. Click on the “Save” button and test the job by clicking the “Start” option.

References

Virtual DataPort VQL Guide: GET_VIEWS

Advanced VQL Guide: GENERATE_STATS

Advanced VQL Guide: GENERATE_STATS_FOR_FIELDS

Scheduler Administration Guide: Time-based Job Scheduling

Scheduler Administration Guide: VDP Data Sources

Scheduler Administration Guide: Mail configuration

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training