### 9. Generating Statistics

In

Part 1 of this series, we discussed the basics of statistics. This month we’ll

explore statistics in-depth, learning how well we can maintain them.

There

are two ways of gathering statistics, COMPUTE and ESTIMATE. COMPUTE results in

100% statistics generation and is accurate. The drawback is that it needs ample

time to generate the complete statistics. ESTIMATE generates statistics as per

the sample number of rows or percent provided. It is faster as only partial

data is considered for arriving at the statistics. The drawback is that

accuracy of the statistics depends on the sample size provided. Please note

that the optimizer is better placed when accurate or near-accurate statistics

are provided. The ESTIMATE option can be used for large systems that cannot

afford the time required for generating statistics that are more accurate.

Please note that columns like NUM_ROWS, AVG_SPACE and AVG_ROW_LEN

may not show exact data if the statistics are being estimated with a specific

sample size. For example, accurate statistics may show you the exact number of

rows in a table when the statistics were taken, but estimate statistics would

show a derived value that might not be equal to the actual number of rows. Read further to find whether your setup will

benefit with the ESTIMATE option.

*How much to estimate?*

If

you intend to ESTIMATE statistics because of time constraints, try to arrive at

an optimal sample size that will yield excellent results for your database. For

example, start with a sample size of 5% and test your application. Increase the

percentage to 10-15% and test your application again.

Some

setups that I have come across report good throughput and response time for

statistics with a sample size of only 5% to 10%. In general, a sample size of 5%

to 10% generates adequate statistics! What will matter here is proper testing

in your setup.

The

accuracy of the statistics will depend on an appropriate sample size. Beyond a

certain sample size, the statistics generated will be fairly consistent. For

example, if you analyze tables with a sample size of 5% and later analyze the

same set of tables with a sample size of 10%, you may find significant changes

in the statistics collected. However, if you analyze the same set of tables

with a sample size of 15-20%, it may not show a significant difference from the

statistics collected at 10% and requires even more time. Thus, an optimal

sample size here would be 10% flat.

If the ESTIMATE sample size is greater than 50%, the statistics

will be almost similar to that of the COMPUTE option, and so will the time it takes

to generate! There is no point estimating beyond 50% (see below example);

rather, go for the COMPUTE option if you can afford that much time.

Gathering

statistics on tables requires sorting to be done and this takes up resources.

Gathering statistics on indexes does not require sorting. Considering this benefit,

you may COMPUTE statistics on indexes for accurate data.

You

may also consider generating statistics in Parallel, an example is provided in the

DBMS_STATS package section below.

Below

is an example of arriving at an appropriate ESTIMATE sample size. Please note

that this is just for one table. For an application, it would be advisable to

try it out at database level and draw conclusions only after proper testing.

*Example*

The

table INV.MATERIAL_TRX has around 4.5 million records. Below are statistics at

various sample sizes.

Commands used:

dbms_stats.gather_table_stats(‘INV’, ‘MTL_MATERIAL_TRX’); –compute

dbms_stats.gather_table_stats(‘INV’, ‘MTL_MATERIAL_TRX’, estimate_percent => 5);

dbms_stats.gather_table_stats(‘INV’, ‘MTL_MATERIAL_TRX’, estimate_percent => 10);

dbms_stats.gather_table_stats(‘INV’, ‘MTL_MATERIAL_TRX’, estimate_percent => 51);

Final statistics:

MODE | NUM_ROWS| BLOCKS|AVG_ROW_LEN|SAMPLE_SIZE|LAST_ANAL|Time taken

_______|__________|__________|___________|___________|_________|__________

compute| 4591474| 193230| 251| 4591474|27-JUL-03|2 hr

at 5% | 4582460| 193230| 247| 229123|27-JUL-03|8 mts

at 10% | 4587520| 193230| 249| 458752|27-JUL-03|17 mts

at 20% | 4591635| 193230| 250| 918327|27-JUL-03|32 mts

at 51% | 4590890.2| 193230| 250| 2341354|27-JUL-03|1 hr 56 mts

As

you can see, the NUM_ROWS difference between full statistics and 5% is only

9014 records. Increasing the sample size to 10% adds 5060 records. Making it

20% adds an additional 4115 records but doubles the time. Bringing it up to

51% results in almost the same number of records as the COMPUTE option, but the

time taken is also same. As per my requirement, statistics at a 5% sample size

serves my purpose as the variance is less than 1% that of actual value.

*What should the time interval be?*

To

determine an efficient statistics-gathering interval, keep a history of

statistics generated and evaluate the variations. If the two statistics remain

more or less similar, then you may consider increasing the time interval. If

the statistics vary considerable then either the sample size is not appropriate

(in the case of ESTIMATE) or the time interval is not appropriate. This is easier

said than done and may be a tedious process for huge systems.

When

there is a 10-20% change in data, the general convention is to generate fresh

statistics. You can start of with a general rule of estimating statistics on a

weekly basis. If the tables are giving real bad hits because of heavy activity,

you may consider using the DML Monitoring option to update statistics every few

hours for such tables.

Statistics

are not incremental and are regenerated every time. If there is no considerable

change in data, there is no advantage in generating statistics too frequently.

You

may consider generating statistics for interface tables every time a bulk

uploading is carried out. Objects can be grouped and statistics generated at

different intervals, for example transaction tables could be analyzed every

week, where as, Master tables might be done once a month. Statistics generation

may also be done schema wise, depending on the maintenance window available.

*Statistics locks?*

A

table that is in the process of being analyzed cannot undergo DDL changes but

DML activities can be carried on. Analyzing an index puts a shared lock on the

related table; hence, neither DML nor DDL can be performed. Preferably avoid

all activities during the statistics generation phase.

Oracle

provides more than one way of generating statistics.

- DBMS_UTILITY
- ANALYZE command
- DBMS_DDL
- DBMS_STATS

Following are the options.