Thursday, 27 February 2014

Collect Stats

Collect Statistics

COLLECT STATISTICS is a Teradata sql command that scans columns and indexes of a table and records demographics of the data.

COLLECT STATISITICS is used to provide the Teradata Optimizer with as much information on data as possible. The Optimizer uses this information to determine how many rows exist and which rows qualify for given values.

Collecting statistics can improve the execution of a sql. The optimizer can have more details about each column or index, and therefore determine a better join plan for resolving the query.

COLLECT STATISTICS COMMANDS SHOULD ONLY BE USED WHEN A SPECIFIC SQL (OR GROUP OF SQLS) IS BENEFITTED.

Because, Collecting statistics may not improve a sql’s efficiency. Therefore, the processing time used to collect the statistics was wasted, and this could be a large amount of processing time.
           
If a sql’s performance is shown to improve from a specific collected statistic, then the statistic should be initially collected and then refreshed on a regular basis.

To possibly avoid spoolspace errors in the application design phase, some common problem columns are listed below in "Collect Statistics - Problem Columns".

With COLLECT STATISTICS:

 - COLLECT STATISTICS can make the usage of these more efficient:
                        Primary indexes
                        Secondary indexes
                        Join Columns or base tables
                        Join Indexes
                        Qualifying columns

 -Statistics are especially informative if index values are distributed unevenly.

 - When a query uses conditionals based on non-unique index values, then Teradata uses statistics to determine whether indexing or a full search of all table rows is more efficient.

              
 - If Teradata determines that indexing is the best method, then it uses the statistics to determine whether spooling or building a bitmap would be the most efficient method of qualifying the data rows.
          
   
Without COLLECT STATISTICS:

- The Optimizer assumes:
  Non-unique indexes are highly non-unique. (Lots of rows per value).
 Non-Index columns are even more non-unique than non-unique indexes.     (Lots of rows per value)

- Teradata derives row counts from a random AMP sample for:
  Small tables (less than 1000 rows per amp),Unevenly distributed tables (skewed row distribution due to PI).

  Random amp sample:
  Look at data from 1 amp of table, and from this, estimate the total rows in the table.

  Random amp samples may not represent the true total number of rows in the table because the rows in the table may not be distributed evenly. This occurs often with small tables. Asof 9/2000, per table, the random amp sample uses the same amp for each sql or query.

1 comment: