Gathering the Statistics of Views

The cost-based optimization requires the statistics. To instruct Virtual DataPort to gather and store the statistics of a view, do the following:

  1. Open the view.

  2. Click Options and then, click the Statistics tab.

  3. Select Enable statistics.

  4. If you select Complete missing statistics executing SELECT queries, the Server, during the process of gathering the statistics of the view, will execute a SELECT statement to gather the statistics that it could not obtain from the system tables of the database.

    If cleared, the Server gathers the statistics from the system tables of the database and does not execute the SELECT statement.

    This option is only available for JDBC base views created over data sources with certain adapters. The section How the Gathering Statistics Process Works lists what statistics are obtained from the system tables of each database vendor.

  5. Do one of these:

    1. Click Gather statistics for selected fields to automatically gather and store the statistics of this view. Once the process finishes, the statistics will be stored in the metadata of the view.

      Important

      Before obtaining the statistics of a JDBC base view, make sure that the source database (Oracle, IBM DB2, etc.) has the statistics of this table. If the statistics are not present or are outdated, the statistics gathered will not be accurate and the execution plans selected by the cost-based optimizer could be suboptimal.

      Virtual DataPort only obtains the statistics of the selected fields (by default all the fields are selected).

      Usually you want to select all the fields to obtain the statistics of all of them at once, but this is not always possible:

      1. It is not possible to obtain the statistics of all the fields. If this is a JDBC base view and the check box “Complete missing statistics…” is selected, Virtual DataPort executes a SELECT query to gather the statistics it could not obtain from the database’s system tables.

        However, the database may return an error when executing this query. For example, some databases return an error when a query tries to obtain the average size of a blob field. In a case such as this, clear the check box of the affected fields to be able to obtain the statistics of the other fields.

      2. The database does not support executing the functions required to gather the statistics. In this case, select the check box Do not delegate the generation of the statistics. By doing this, Virtual DataPort retrieves all the data from the table (executes a SELECT * FROM view) and executes the aggregation functions locally. We do not recommend selecting this option, unless it is necessary.

      3. Obtaining the statistics of all the fields at once may put too much strain on the data source that holds the data and you want to avoid it. In this case, you may want to obtain the statistics of some of the fields but not all of them.

      Read the section Gathering Statistics Automatically to fully understand the process of gathering the statistics of a view.

    2. You can also provide the statistics of the view manually, but we recommend doing it automatically as described in the previous step. To manually provide the statistics of the view, enter the Number of rows and, in the table below, the required values for each field:

      1. The average size of its values.
      2. The maximum value and minimum value (not for values of type text)
      3. The number of distinct values.
      4. Number of NULL values.
Gathering the statistics of a view

Gathering the statistics of a view

The cost-based optimizer does not use the maximum and minimum values of the fields of type blob, boolean, date, text, xml, array or record. Therefore, they cannot be specified (in the table, the cells of these fields are disabled and have the label “N/A”).

Once the statistics of a view have been gathered or you have provided them manually, you can disable the statistics of the view by clearing the Enable statistics check box. By doing this, the statistics are not “lost” but the cost-based optimization will not use them and therefore, the Server will not apply this optimization in a query if the optimization needs the statistics of this view.


You can also gather the statistics of several views at once. To do this, click Manage statistics on the Tools menu.

Gathering the statistics of several views at once

Gathering the statistics of several views at once

In this dialog, select the views that you want to obtain its statistics and click Gather. To select all the views of a folder, select the folder.

For the views whose statistics already have been obtained (views with “On” or “Off” in the “Status” column of the table), their statistics can be enabled/disabled by selecting the views and clicking Enable or Disable.

When the statistics of a view are disabled (“Off” status) or have not been gathered (“N/A” status), the Server will not apply the cost-based optimization in a query, if it needs the statistics of one of these views.

To search a view or a folder, type its name in the box located at the top of the dialog.

The statistics of a view are not included in version control because they are environment-dependent.

Gathering Statistics Automatically

This section describes the process for automatically gathering the statistics of the views used by the cost-based optimization:

How the Gathering Statistics Process Works

This section explains how the Server gathers the statistics of a view.

Important

Before obtaining the statistics for a JDBC base view, make sure that the source database (Oracle, IBM DB2, etc.) has the statistics of this table. If the statistics are not present or are outdated, the statistics gathered will not be accurate and the execution plans selected by the cost-based optimizer could be suboptimal.

The process to gather the statistics of a view depends on the type of base view:

  • JDBC base views: the Server gathers the statistics of the view from the system tables of the database. The information obtained from these tables depends on the database adapter of the view’s data source. The reason is that some databases provide more statistics about their tables/views than others.

    The table Statistics that are gathered from the system tables of each vendor lists the statistics that the Server can gather from each database vendor.

    If “Complete missing statistics…” is selected, the Server executes an additional SELECT query to obtain the statistics it cannot obtain from the system tables.

  • Other views: Virtual DataPort executes a query like the one in Sample SQL query executed to gather statistics of a non-JDBC base view to gather the statistics of the view. If the view involves an ODBC source, part of this query may be pushed down to that source, thus making the process faster.

Sample SQL query executed to gather statistics of a non-JDBC base view
SELECT count(*),
    , avg( length( <text field> ) )
    , max( <text field> )
    , min( <text field> )
    , count( DISTINCT <text field> )
    , ...
    , max( <numeric field> )
    , min( <numeric field> )
    , count( distinct( <numeric field> ) )
    , count( <numeric field> )
    , ...
    FROM view

There are databases that do not support projecting several aggregation functions in a SELECT statement. In this case, select the check box “Do not delegate the generation of the statistics” before clicking the button “Gather statistics for selected fields”. By doing this, Virtual DataPort retrieves all the data from the table (executes a SELECT * FROM view) and executes the aggregation functions locally. Whenever possible, avoid selecting this option because the database is forced to send all the data stored in the table to Virtual DataPort. In addition, usually databases execute these functions faster.

When the data is obtained from other types of sources (e.g. delimited files, XML, etc.), Virtual DataPort retrieves all the rows from the source and computes the statistics by itself. In views where all the data is obtained from sources that are not databases, selecting the check box “Gather statistics for…” is equivalent to clearing it.

Note

If the statistics are not gathered from the system tables of a database, the process of gathering statistics can take a long time for large views. It can also induce significant workload in both Virtual DataPort and the data sources. Therefore, it is convenient to gather the statistics during periods where the system is not expected to be under heavy load (see the section Scheduling the Gathering of Statistics for details about how to schedule the statistics gathering process).

The following table lists which statistics are gathered from the system tables of each database vendor.

Statistics that are gathered from the system tables of each vendor
Database Adapter Number of Rows of the Table Average Size Min Value Max Value Distinct Values Null Values
Amazon RedShift X X     X X
Cloudera Impala X X     X X
Greenplum X X     X X
IBM DB2 (*) X X X X X X
Microsoft SQL Server X X     X  
Netezza X       X  
Oracle X X X X X X
PostgreSQL X X     X X
Teradata X       X  
Vertica         X  

(*): for data sources whose adapter is “DB2 for z/OS” and not “DB2”, the statistics are gathered issuing SELECT queries and not from the system tables.

For the vendors not listed in this table, the statistics are obtained issuing SELECT queries like the one in Sample SQL query executed to gather statistics of a non-JDBC base view.


The following table lists the tables, statements or functions Virtual DataPort queries or executes to obtain the statistics listed in the table above. They depend on the database from which Virtual DataPort obtains the statistics.

Tables, statements or functions executed by Virtual DataPort to obtain statistics
Database Adapter Elements Inspected to Obtain Statistics
Amazon Redshift System tables: PG_STATS, PG_CLASS and PG_NAMESPACE, PG_INHERITS.
Cloudera Impala Statements: SHOW COLUMN STATS and SHOW TABLE STATS.
Greenplum System tables: PG_STATS, PG_CLASS and PG_NAMESPACE, PG_INHERITS.
IBM DB2 (*) System tables: SYSSTAT.COLUMNS and SYSSTAT.TABLES.
Microsoft SQL Server System tables: SYS.OBJECTS and SYS.PARTITIONS.
Netezza System tables: _V_RELATION_COLUMN_XDB and _V_TABLE_XDB.
Oracle 9i System tables: ALL_TAB_COLUMNS and ALL_ALL_TABLES.
Oracle >= 9i System tables: ALL_TAB_COLUMNS and ALL_TAB_STATISTICS
PostgreSQL System tables: PG_STATS, PG_CLASS and PG_NAMESPACE, PG_INHERITS.
Teradata Statements: HELP STATISTICS
Vertica Functions: APPROXIMATE_COUNT_DISTINCT

The Server does not collect statistics for the fields of the following types:

  • blob
  • boolean
  • xml
  • array
  • register.
  • Any field whose source type property is one of the following:
    • BIT
    • CLOB
    • LONGVARCHAR
    • LONGNVARCHAR

For the fields of some of these types, it sets a value that is extrapolated from other statistics. For example, for blob, array and register fields, it assumes that each value is different. Therefore, the value of the statistic “Distinct values” is the number of rows of the view.

The cost-based optimizer does not use the maximum and minimum values of the fields of type blob, boolean, date, text, xml, array or record. Therefore, in the statistics table of the view, the cells of these fields are disabled and have the label “N/A”.

Scheduling the Gathering of Statistics

The stored procedure GENERATE_STATS gathers and stores the statistics of a view. Use Denodo Scheduler to schedule the gathering of the statistics of a view. The main objective of this is to gather them during periods where the system is not expected to be under heavy load.

See more about this procedure in the section GENERATE_STATS of the VQL Guide.

Limitations of the Gathering Statistics Process

There are several limitations in the process of gathering statistics. When these limitations apply, statistics can be specified manually for the affected fields:

  • In some databases, the statistics are not obtained for fields of some database data types, even if in Virtual DataPort they are mapped to a data type not included in the previous bullet. More precisely:
    • In IBM DB2, statistics are not collected for table fields of the data types LONGVARCHAR and LONGVARGRAPHIC.
    • In Microsoft SQL Server, statistics are not collected for table fields of the data types TEXT and NTEXT.
    • In Oracle, statistics are not collected for table fields of the data types CLOB, NCLOB, and SQLXML.
  • Before gathering the statistics of views whose data is obtained from Informix 7.x, select the check box “Do not delegate the generation of statistics”. In later versions of Informix, it is not possible to collect statistics from tables with fields of types INTERVAL.
  • It is not possible to collect statistics from tables in SQL Server with fields of types UNIQUEIDENTIFIER or SQLVARIANT.
  • It is not possible to collect statistics for views that have mandatory fields.