We
covered the various methods of generating statistics in the last installment; moving
forward, let’s look at the data skewness problem and the
use of the Monitoring option for updating statistics on the fly.
10. Histograms
Histogram
tells the optimizer how the data is distributed for a column. This information
is used in determining the selectivity of the column for a given query and arriving
at an optimal execution plan.
Column
statistics in the form of histograms are appropriate for columns whose data
distribution deviates from the expected uniform distribution. For uniformly
distributed data, the optimizer can do proper costing for executing a
particular statement. When data is not uniformly distributed, also know as highly
skewed data distribution, the optimizer may not be in a position to accurately
estimate the selectivity of a query. This option provides statistics at a very
low level and its use would be rare, though it could prove very beneficial in
certain scenarios.
Columns not eligible for histogram
Please
note that histograms should not be used when any of the following is true.
-
The column data is uniformly distributed.
For example, if we have a column in a table that holds around 100
distinct values. If the number of records that each value holds (or a range of
values) is somewhat similar, the data is more or less uniformly distributed.For instance, the table may have 100000 records, 20% of these are
between values 1-15, 15% are between 16-30, 25% are between 31-50 and so on. If
we design a data distribution graph (histogram chart) on these figures, the
height of each value or range will be more or less balanced. -
The column is not at all used in query predicates.
There
is no need for providing histogram statistics on columns that are not being
used in query conditions. Histogram statistics are stored in the dictionary and
do take up space and analysis time of the optimizer. -
All query predicates or criteria for the column use bind variables!
Yes
that’s right, here Oracle requires that hard coded values be provided for use
of the histogram statistics. Bind variables will not make use of the same. -
The column is unique and used only with equality predicates.
Columns eligible for histogram
If
the above rules are not satisfying for a column, it could be considered for
distribution statistics. For example, if we again consider a column that holds
around 100 distinct statuses in a 100000 records table, 80% of the values lie
between 15-30 and the remaining 20% is distributed in the other ranges. If we
design a data distribution graph (histogram chart) on these figures, the height
of few values or ranges will be very high where as others will be very low.
This shows an oblique line for the data distribution.
A
histogram is good for number columns. For character columns only the first 32
bytes (as of 8.1.7.4) of the string are used for building the histograms, this
may sometimes result in incorrect information being created if the size of the
column data exceeds this limitation.
User-specific
histogram values can also be stored in the dictionary using the
DBMS_STATS.PREPARE_COLUMN_VALUES and DBMS_STATS.SET_COLUMN_VALUES routines.
Dictionary tables
Histogram
information is stored in the following dictionary tables.
Histogram
values for columns in tables:
DBA_TAB_HISTOGRAMS
- endpoint_number – End point number
- endpoint_value – Normalized end point value for the
buckets. - endpoint_actual_value – Actual data value, only shows
non-numeric value for the column.
For
partition table histograms values:
DBA_PART_HISTOGRAMS
DBA_SUBPART_HISTOGRAMS
For
evaluating histograms on indexed columns:
INDEX_HISTOGRAM
- repeat_count
– Number of times one or more index key is repeated in the table. - keys_with_repeat_count
– Number of index keys that are repeated.
Other
Views that give similar data:
DBA_TAB_COL_STATISTICS
DBA_PART_COL_STATISTICS
DBA_SUPPART_COL_STATISTICS
Columns
in the above tables are self-explanatory.
Buckets in Histograms
Histogram
statistics are stored in the form of buckets. Buckets represent the partitioning
of data values, depending on the range. By default, 75 buckets are created. A
maximum of 254 buckets can be specified for a column. How many buckets are
required for a column will depend on the occurrences of distinct values. The
default number of buckets is appropriate, but you will have to experiment with
various bucket sizes to find the most suitable size.
If
the number of distinct column values is less than the number of buckets
specified, the individual column values and the count of these values is
directly stored as histogram statistics. If the number of distinct column values
is more than the buckets specified, Oracle uses an algorithm to store these
values in ranges. If a series of continuous ranges have the same value, they
may not be shown in the histogram table to save on space.
You
may find columns with one-bucket histograms, these are as good as no histogram
statistics and the optimizer ignores them.