Generating Histograms
These
statistics could be generated using the following commands:
DBMS_STATS.GATHER_TABLE_STATS
with METHOD_OPT => 'FOR COLUMNS <colname> SIZE xxx'
or
ANALYZE
command for column statistics with SIZE option.
The
SIZE option in the above command specifies the number of buckets to be stored
e.g.: Gather histogram statistics for AM20.ENAME.
Since there are 17 distinct values in the column, I can choose to have about 1
to 17 buckets in the histogram. As the frequency of each value is less, an
approximate of 5 buckets is chosen initially.
SQL> exec dbms_stats.gather_table_stats(ownname => 'SYS', tabname => 'AM20',-
> method_opt => 'FOR COLUMNS ENAME SIZE 5');
PL/SQL procedure successfully completed.
SQL> SELECT table_name, column_name,num_distinct,num_buckets,last_analyzed,
2 avg_col_len
3 from dba_tab_col_statistics
4 where table_name = 'AM20';
TABLE_NAME COLUMN_NAM NUM_DISTINCT NUM_BUCKETS LAST_ANAL AVG_COL_LEN
---------- ---------- ------------ ----------- --------- -----------
AM20 ENAME 17 5 17-OCT-03 6
SQL> select owner, table_name, column_name,
2 to_char(endpoint_number, '99999.99') EP_NO,
3 trunc(endpoint_value) EP_VAL,
4 endpoint_actual_value EP_A_VAL
5 from dba_tab_histograms
6 where table_name = 'AM20'
7 and column_name = 'ENAME'
8 order by endpoint_number;
OWNER TABLE_NAME COLUMN_NAM EP_NO EP_VAL EP_A_VAL
---------- ---------- ---------- --------- ---------- -----------
SYS AM20 ENAME .00 3.3888E+35 ADAMS
SYS AM20 ENAME 1.00 3.6507E+35 FORD
SYS AM20 ENAME 2.00 4.0632E+35 NARAYAN
SYS AM20 ENAME 3.00 4.3269E+35 SUNIL
SYS AM20 ENAME 4.00 4.3788E+35 TURNER
SYS AM20 ENAME 5.00 4.5305E+35 WARD
6 rows selected.
Again,
if you are maintaining histograms for a frequently changing column, it is
recommended to update the histograms more frequently to provide accurate
statistics to the optimizer.
Histogram on indexed columns
Oracle
provides a method of analyzing indexed columns that need to be provided with
histogram statistics. This can be done by using the INDEX_HISTOGRAM view.
e.g.: Column ename is provided with an index in
table AM20. This will be used to analyze the density of the values. As can be
seen below, the INDEX_HISTOGRAM view shows the occurrences of the distinct
values. This information can be analyzed to find out if the data is skewed.
SQL> create index am20_ind1 on am20(ename);
Index created.
SQL> analyze index am20_ind1 validate structure;
Index analyzed.
SQL> select * from index_histogram;
REPEAT_COUNT KEYS_WITH_REPEAT_COUNT
-------------------- ----------------------
0 0
1 12
2 0
3 0
4 1
5 1
6 0
7 0
8 1
9 2
10 0
11 0
12 0
13 0
14 0
15 0
16 rows selected.
Histogram use
The
following is a simple example of how the histogram is actually used by the
optimizer to pick up an index based on the selectivity of the column present in
the predicate.
e.g.: There are 600 records in the table AM20. ENAME
data 'SUNIL' has 512 records and 'TURNER' has 9 records. Without the histogram
statistics, the index is used to retrieve values for both the queries as shown
below.
SQL> select * from am20 where ename = 'TURNER';
...
9 rows selected.
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=1 Card=1 Bytes=11)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'AM20' (Cost=1 Card=6 Bytes=66)
3 2 INDEX (RANGE SCAN) OF 'AM20_N1' (NON-UNIQUE) (Cost=1 Card=6)
SQL> select * from am20 where ename = 'SUNIL';
...
512 rows selected.
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=1 Card=1 Bytes=11)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (BY INDEX ROWID) OF 'AM20' (Cost=1 Card=6 Bytes=66)
3 2 INDEX (RANGE SCAN) OF 'AM20_N1' (NON-UNIQUE) (Cost=1 Card=6)
The
use of index in retrieving 9 rows is fast, but to retrieve 512 records (85% of
the table) with an index is overkill. The optimizer is not equipped with the
data distribution statistics to make this decision. Histogram data is provided
and the query is re-run as shown below, with the right execution plan this
time.
SQL> exec dbms_stats.gather_table_stats(ownname => 'SYS', tabname => 'AM20',-
> method_opt => 'FOR COLUMNS ENAME SIZE 75');
PL/SQL procedure successfully completed.
SQL> select * from am20 where ename = 'TURNER';
...
512 rows selected.
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=FIRST_ROWS (Cost=1 Card=1 Bytes=10)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'AM20' (Cost=1 Card=9 Bytes=90)