Gathering the Statistics of Views¶
The cost-based optimization requires the statistics. To instruct Virtual DataPort to gather and store the statistics of a view, do the following:
Open the view.
Click Options and then, click the Statistics tab.
Select Enable statistics.
If you select Complete missing statistics executing SELECT queries, the Server, during the process of gathering the statistics of the view, will execute a
SELECT
statement to gather the statistics that it could not obtain from the system tables of the database.If cleared, the Server gathers the statistics from the system tables of the database and does not execute the SELECT statement.
This option is only available for JDBC base views created over data sources with certain adapters. The section How the Gathering Statistics Process Works lists what statistics are obtained from the system tables of each database vendor.
Do one of these:
Click Gather statistics for selected fields to gather the statistics of this view. Once the process finishes, the statistics form will be fulfilled with the obtained values. Then these values can be stored using the save button, as for any other change in a view.
Important
Before obtaining the statistics of a JDBC base view, make sure that the source database (Oracle, IBM DB2, etc.) has the statistics of this table. If the statistics are not present or are outdated, the statistics gathered will not be accurate and the execution plans selected by the cost-based optimizer could be suboptimal.
Virtual DataPort only obtains the statistics of the selected fields (by default all the fields are selected).
Usually you want to select all the fields to obtain the statistics of all of them at once, but this is not always possible:
It is not possible to obtain the statistics of all the fields. If this is a JDBC base view and the check box “Complete missing statistics…” is selected, Virtual DataPort executes a
SELECT
query to gather the statistics it could not obtain from the database’s system tables.However, the database may return an error when executing this query. For example, some databases return an error when a query tries to obtain the average size of a blob field. In a case such as this, clear the check box of the affected fields to be able to obtain the statistics of the other fields.
Obtaining the statistics of all the fields at once may put too much strain on the data source that holds the data and you want to avoid it. In this case, you may want to obtain the statistics of some of the fields but not all of them.
Read the section Gathering Statistics Automatically to fully understand the process of gathering the statistics of a view.
You can also provide the statistics of the view manually, but we recommend doing it automatically as described in the previous step. To manually provide the statistics of the view, enter the Number of rows and, in the table below, the required values for each field:
The average size of its values.
The maximum value and minimum value (not for values of type
text
)The number of distinct values.
Number of
NULL
values.
The cost-based optimizer does not use the maximum and minimum values of
the fields of type blob
, boolean
, date
, text
, xml
,
array
or record
. Therefore, they cannot be specified (in the
table, the cells of these fields are disabled and have the label “N/A”).
Once the statistics of a view have been gathered or you have provided them manually, you can disable the statistics of the view by clearing the Enable statistics check box. By doing this, the statistics are not “lost” but the cost-based optimization will not use them and therefore, the Server will not apply this optimization in a query if the optimization needs the statistics of this view.
You can also gather the statistics of several views at once. To do this, click the menu Tools > Query optimizations > Manage statistics.
In this dialog, select the views that you want to obtain its statistics and click Gather. To select all the views of a folder, select the folder.
Note
Denodo gathers the statistics of the views using the settings of the dialog Options > tab Statistics of each view (option Complete missing statistics executing SELECT queries of the view).
Depending on these options and the underlying data source, the statistics of a view may be incomplete. For example, the default behavior for JDBC base views is to obtain the statistics from the system tables of the database, without executing any query. Depending on the database, this information may not be complete and you may need to enable the option Complete missing statistics executing SELECT queries of the view and gather the statistics again (see section How the Gathering Statistics Process Works).
Therefore, we recommend reviewing the statistics of the views after this process, in order to ensure that the statistics were gathered as expected.
For the views whose statistics already have been obtained (views with “On” or “Off” in the “Status” column of the table), their statistics can be enabled/disabled by selecting the views and clicking Enable or Disable.
When the statistics of a view are disabled (“Off” status) or have not been gathered (“N/A” status), the Server will not apply the cost-based optimization in a query, if it needs the statistics of one of these views.
To search a view or a folder, type its name in the box located at the top of the dialog.
The statistics of a view are not included in version control because they are environment-dependent.
Gathering Statistics Automatically¶
This section describes the process for automatically gathering the statistics of the views used by the cost-based optimization:
Section How the Gathering Statistics Process Works describes how the gathering process works.
Section Scheduling the Gathering of Statistics describes how to use Denodo Scheduler to collect and update statistics on a periodical basis.
Section Limitations of the Gathering Statistics Process describes some current limitations of the gathering process.
How the Gathering Statistics Process Works¶
This section explains how the Server gathers the statistics of a view.
Important
Before obtaining the statistics for a JDBC base view, make sure that the source database (Oracle, IBM DB2, etc.) has the statistics of this table. If the statistics are not present or are outdated, the statistics gathered will not be accurate and the execution plans selected by the cost-based optimizer could be suboptimal.
The process to gather the statistics of a view depends on the type of base view:
JDBC base views: the Server gathers the statistics of the view from the system tables of the database. The information obtained from these tables depends on the database adapter of the view’s data source. The reason is that some databases provide more statistics about their tables/views than others.
The table Statistics that are gathered from the system tables of each vendor lists the statistics that the Server can gather from each database vendor.
If “Complete missing statistics…” is selected, the Server executes an additional SELECT query to obtain the statistics it cannot obtain from the system tables.
On views created over an element of an object storage (Object Storage data in Parquet Format), Virtual DataPort also computes the statistics on the Embedded MPP.
This does not apply to JDBC base views created from a query, nor to JDBC base views created over stored procedures. In these two cases, the statistics are gathered like for any non-JDBC base view.
Other views: Virtual DataPort executes a query like the one in Sample SQL query executed to gather statistics of a non-JDBC base view to gather the statistics of the view. If the view involves an ODBC source, part of this query may be pushed down to that source, thus making the process faster.
SELECT count(*),
, avg( length( <text field> ) )
, max( <text field> )
, min( <text field> )
, count( DISTINCT <text field> )
, ...
, max( <numeric field> )
, min( <numeric field> )
, count( distinct( <numeric field> ) )
, count( <numeric field> )
, ...
FROM view
When the data is obtained from other types of sources (e.g. delimited files, XML, etc.), Virtual DataPort retrieves all the rows from the source and computes the statistics by itself. In views where all the data is obtained from sources that are not databases, selecting the check box “Gather statistics for…” is equivalent to clearing it.
Note
If the statistics are not gathered from the system tables of a database, the process of gathering statistics can take a long time for large views. It can also induce significant workload in both Virtual DataPort and the data sources. Therefore, it is convenient to gather the statistics during periods where the system is not expected to be under heavy load (see the section Scheduling the Gathering of Statistics for details about how to schedule the statistics gathering process).
The following table lists which statistics are gathered from the system tables of each database vendor.
Database Adapter |
Number of Rows of the Table |
Average Size |
Min Value |
Max Value |
Distinct Values |
Null Values |
---|---|---|---|---|---|---|
Amazon Redshift |
||||||
Azure Synapse SQL |
||||||
ClickHouse |
||||||
Cloudera Impala |
||||||
Databricks |
||||||
Exasol |
||||||
Greenplum |
||||||
Hive |
||||||
IBM DB2 |
||||||
IBM DB2 z/OS |
||||||
Microsoft SQL Server |
||||||
Netezza |
||||||
Oracle |
||||||
PostgreSQL |
||||||
PrestoDB/ Trino (*) |
||||||
SAP HANA |
||||||
Spark 2.x |
||||||
SQreamDB |
||||||
Teradata (**) |
||||||
Vertica |
(*): by default, the PrestoDB/Trino adapter can obtain the statistics from the system tables of PrestoDB/Trino version 183 or higher, but not from earlier versions. To be able to do this for earlier versions of Presto, follow these steps:
Obtain the version of PrestoDB/Trino you are connecting to.
Login as an administrator and execute this command from the VQL Shell:
SET 'com.denodo.vdb.contrib.storedprocedure.GenerateStatsProcedure.presto.statsversion' = '<value>';
where <value> is:
pre_177
: if the version of PrestoDB/Trino is 176 or lower. In these versions, the system tables do not provide statistics.post_177
: if the version is 177 or higher but lower than 183. In these versions, the system tables provide the statistics “number of rows”, “distinct values” and “null values” but not “min value” and “max value”.post_183
(default value): if the version is 183 or higher. In these versions, the system tables provide the statistics listed in the table above.
In the installation of the administration tool you use, edit the file
<DENODO_HOME>/conf/vdp-admin/VDBAdminConfiguration.properties
and add this propertycom.denodo.vdb.contrib.storedprocedure.GenerateStatsProcedure.presto.statsversion=<version>
Replace <version> with the same value you set in the previous step.
Restart the Virtual DataPort server and the administration tool.
(**): For Teradata 13 or older only the number of rows and distinct values are available
For the vendors not listed in this table, the statistics are obtained
issuing SELECT
queries like the one in Sample SQL query executed to gather statistics of a non-JDBC base view.
The following table lists the tables, statements or functions Virtual DataPort queries or executes to obtain the statistics listed in the table above. They depend on the database from which Virtual DataPort obtains the statistics.
Database Adapter |
Elements Inspected to Obtain Statistics |
---|---|
Amazon Redshift |
System tables: PG_STATS, PG_CLASS and PG_NAMESPACE, PG_INHERITS. |
Azure Synapse SQL |
System tables: SYS.OBJECTS, SYS.PARTITIONS, SYS.STATS, SYS.STATS_COLUMNS. |
ClickHouse |
Functions: uniq System tables: system.tables. |
Cloudera Impala |
Statements: SHOW COLUMN STATS and SHOW TABLE STATS. |
Databricks |
Statements: DESCRIBE EXTENDED. |
Exasol |
System tables: $EXA_COLUMN_STATISTICS. |
Greenplum |
System tables: PG_STATS, PG_CLASS and PG_NAMESPACE, PG_INHERITS. |
Hive |
Statements: DESCRIBE FORMATTED. |
IBM DB2 (*) |
System tables: SYSCAT.COLUMNS and SYSCAT.TABLES. |
Microsoft SQL Server |
System tables: SYS.OBJECTS, SYS.PARTITIONS, SYS.STATS, SYS.STATS_COLUMNS. |
Netezza |
System tables: _V_RELATION_COLUMN_XDB and _V_TABLE_XDB. |
Oracle 9i |
System tables: ALL_TAB_COLUMNS and ALL_ALL_TABLES. |
Oracle >= 9i |
System tables: ALL_TAB_COLUMNS and ALL_TAB_STATISTICS |
PostgreSQL |
System tables: PG_STATS, PG_CLASS and PG_NAMESPACE, PG_INHERITS. |
PrestoDB/Trino |
Statements: SHOW STATS |
SAP HANA 1.0 and 2.0 |
System table: SYS.M_DATA_STATISTICS |
Spark 2.x |
Statements: DESCRIBE EXTENDED |
SQreamDB |
System table: SQREAM_CATALOG.COLUMNS |
Teradata |
Statements: HELP STATISTICS, SHOW STATS VALUES System table: DBC.STATSV |
Vertica |
Functions: APPROXIMATE_COUNT_DISTINCT |
The Server does not collect statistics for the fields of the following types:
blob
boolean
xml
array
register
.Any field whose source type property is one of the following:
BIT
CLOB
LONGVARCHAR
LONGNVARCHAR
For the fields of some of these types, it sets a value that is
extrapolated from other statistics. For example, for blob
, array
and register
fields, it assumes that each value is different.
Therefore, the value of the statistic “Distinct values” is the number of
rows of the view.
The cost-based optimizer does not use the maximum and minimum values of
the fields of type blob
, boolean
, date
, text
, xml
,
array
or record
. Therefore, in the statistics table of the view,
the cells of these fields are disabled and have the label “N/A”.
Scheduling the Gathering of Statistics¶
The stored procedure GENERATE_STATS
gathers and stores the
statistics of a view. Use Denodo Scheduler to schedule the gathering of
the statistics of a view. The main objective of this is to gather them
during periods where the system is not expected to be under heavy load.
See more about this procedure in the section GENERATE_STATS of the VQL Guide.
Limitations of the Gathering Statistics Process¶
There are several limitations in the process of gathering statistics. When these limitations apply, statistics can be specified manually for the affected fields:
In some databases, the statistics are not obtained for fields of some database data types, even if in Virtual DataPort they are mapped to a data type not included in the previous bullet. More precisely:
In IBM DB2, statistics are not collected for table fields of the data types
LONGVARCHAR
andLONGVARGRAPHIC
.In Microsoft SQL Server, statistics are not collected for table fields of the data types
TEXT
andNTEXT
.In Oracle, statistics are not collected for table fields of the data types
CLOB
,NCLOB
, andSQLXML
.In Spark, minimum and maximum metrics are not available for binary and string types.
In later versions of Informix, it is not possible to collect statistics from tables with fields of types
INTERVAL
.It is not possible to collect statistics from tables in SQL Server with fields of types
UNIQUEIDENTIFIER
orSQLVARIANT
.It is not possible to collect statistics for views that have mandatory fields.