### 9. Generating Statistics

In
Part 1 of this series, we discussed the basics of statistics. This month we'll
explore statistics in-depth, learning how well we can maintain them.

There
are two ways of gathering statistics, COMPUTE and ESTIMATE. COMPUTE results in
100% statistics generation and is accurate. The drawback is that it needs ample
time to generate the complete statistics. ESTIMATE generates statistics as per
the sample number of rows or percent provided. It is faster as only partial
data is considered for arriving at the statistics. The drawback is that
accuracy of the statistics depends on the sample size provided. Please note
that the optimizer is better placed when accurate or near-accurate statistics
are provided. The ESTIMATE option can be used for large systems that cannot
afford the time required for generating statistics that are more accurate.

Please note that columns like NUM_ROWS, AVG_SPACE and AVG_ROW_LEN
may not show exact data if the statistics are being estimated with a specific
sample size. For example, accurate statistics may show you the exact number of
rows in a table when the statistics were taken, but estimate statistics would
show a derived value that might not be equal to the actual number of rows. Read further to find whether your setup will
benefit with the ESTIMATE option.

*How much to estimate?*

If
you intend to ESTIMATE statistics because of time constraints, try to arrive at
an optimal sample size that will yield excellent results for your database. For
example, start with a sample size of 5% and test your application. Increase the
percentage to 10-15% and test your application again.

Some
setups that I have come across report good throughput and response time for
statistics with a sample size of only 5% to 10%. In general, a sample size of 5%
to 10% generates adequate statistics! What will matter here is proper testing
in your setup.

The
accuracy of the statistics will depend on an appropriate sample size. Beyond a
certain sample size, the statistics generated will be fairly consistent. For
example, if you analyze tables with a sample size of 5% and later analyze the
same set of tables with a sample size of 10%, you may find significant changes
in the statistics collected. However, if you analyze the same set of tables
with a sample size of 15-20%, it may not show a significant difference from the
statistics collected at 10% and requires even more time. Thus, an optimal
sample size here would be 10% flat.

If the ESTIMATE sample size is greater than 50%, the statistics
will be almost similar to that of the COMPUTE option, and so will the time it takes
to generate! There is no point estimating beyond 50% (see below example);
rather, go for the COMPUTE option if you can afford that much time.

Gathering
statistics on tables requires sorting to be done and this takes up resources.
Gathering statistics on indexes does not require sorting. Considering this benefit,
you may COMPUTE statistics on indexes for accurate data.

You
may also consider generating statistics in Parallel, an example is provided in the
DBMS_STATS package section below.

Below
is an example of arriving at an appropriate ESTIMATE sample size. Please note
that this is just for one table. For an application, it would be advisable to
try it out at database level and draw conclusions only after proper testing.

*Example*

The
table INV.MATERIAL_TRX has around 4.5 million records. Below are statistics at
various sample sizes.

Commands used:

dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX'); --compute
dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX', estimate_percent => 5);
dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX', estimate_percent => 10);
dbms_stats.gather_table_stats('INV', 'MTL_MATERIAL_TRX', estimate_percent => 51);

Final statistics:

MODE | NUM_ROWS| BLOCKS|AVG_ROW_LEN|SAMPLE_SIZE|LAST_ANAL|Time taken
_______|__________|__________|___________|___________|_________|__________
compute| 4591474| 193230| 251| 4591474|27-JUL-03|2 hr
at 5% | 4582460| 193230| 247| 229123|27-JUL-03|8 mts
at 10% | 4587520| 193230| 249| 458752|27-JUL-03|17 mts
at 20% | 4591635| 193230| 250| 918327|27-JUL-03|32 mts
at 51% | 4590890.2| 193230| 250| 2341354|27-JUL-03|1 hr 56 mts

As
you can see, the NUM_ROWS difference between full statistics and 5% is only
9014 records. Increasing the sample size to 10% adds 5060 records. Making it
20% adds an additional 4115 records but doubles the time. Bringing it up to
51% results in almost the same number of records as the COMPUTE option, but the
time taken is also same. As per my requirement, statistics at a 5% sample size
serves my purpose as the variance is less than 1% that of actual value.

*What should the time interval be?*

To
determine an efficient statistics-gathering interval, keep a history of
statistics generated and evaluate the variations. If the two statistics remain
more or less similar, then you may consider increasing the time interval. If
the statistics vary considerable then either the sample size is not appropriate
(in the case of ESTIMATE) or the time interval is not appropriate. This is easier
said than done and may be a tedious process for huge systems.

When
there is a 10-20% change in data, the general convention is to generate fresh
statistics. You can start of with a general rule of estimating statistics on a
weekly basis. If the tables are giving real bad hits because of heavy activity,
you may consider using the DML Monitoring option to update statistics every few
hours for such tables.

Statistics
are not incremental and are regenerated every time. If there is no considerable
change in data, there is no advantage in generating statistics too frequently.

You
may consider generating statistics for interface tables every time a bulk
uploading is carried out. Objects can be grouped and statistics generated at
different intervals, for example transaction tables could be analyzed every
week, where as, Master tables might be done once a month. Statistics generation
may also be done schema wise, depending on the maintenance window available.

*Statistics locks?*

A
table that is in the process of being analyzed cannot undergo DDL changes but
DML activities can be carried on. Analyzing an index puts a shared lock on the
related table; hence, neither DML nor DDL can be performed. Preferably avoid
all activities during the statistics generation phase.

Oracle
provides more than one way of generating statistics.

- DBMS_UTILITY
- ANALYZE command
- DBMS_DDL
- DBMS_STATS

Following are the options.