A useful new feature in Oracle 12c (12.1.0.2) is the Hybrid histogram, a new type of histogram that uses a reduced number of buckets compared to the number of distinct values in the table data and utilizes the repeat frequency of the endpoint values. This type of histogram can adjust the bucket size based on endpoint values due to the criteria Oracle sets when it generates such a histogram. Unlike other histograms Oracle can generate, the Hybrid histogram is based on four ‘rules’, which are:
- a value should not be found in more than one bucket
- the bucket size is allowed to be extended in order to contain all instances of the same distinct value
- adjusted bucket size cannot be less than the original size (not applicable at either end of the data set)
- the original number of buckets should not be reduced
Let’s look at how Oracle decides on, and creates, a Hybrid histogram. Additionally we’ll look at the instability of a Hybrid histogram as values are added to the data set. We take the table and data from an example by Jonathan Lewis:
SQL>
SQL> --
SQL> -- Create and populate a table so we can
SQL> -- build a hybrid histogram
SQL> --
SQL> -- Thanks to Jonathan Lewis for generating
SQL> -- this table and data set
SQL> --
SQL> create table t1 (id number, n1 number);
Table created.
Elapsed: 00:00:00.01
SQL>
SQL> @InsT1.sql
Elapsed: 00:00:00.00
SQL>
To create a hybrid histogram we choose a number of buckets less than the number of distinct values in the table:
SQL>
SQL> --
SQL> -- Return the number of distinct values
SQL> -- in column n1
SQL> --
SQL>
SQL> select count(distinct n1) from t1;
COUNT(DISTINCTN1)
-----------------
37
Elapsed: 00:00:00.00
SQL>
Let’s choose 20 buckets for our histogram and use DBMS_STATS to create it; let’s also trace the DBMS_STATS session to see what decisions it makes with respect to the data:
SQL>
SQL> --
SQL> -- Create a hybrid histogram on column n1
SQL> -- by setting the number of buckets less
SQL> -- than the number of distinct values
SQL> --
SQL> -- Trace the DBMS_STATS session to see if
SQL> -- a hybrid histogram is selected and
SQL> -- generated
SQL> --
SQL> exec dbms_stats.set_global_prefs ('TRACE', to_char (16));
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.03
SQL> BEGIN
2 dbms_stats.gather_table_stats
3 (user, 't1', method_opt => 'for columns n1 size 20');
4 END;
5 /
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.43
SQL> exec dbms_stats.set_global_prefs('TRACE', null);
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.01
SQL>
The trace file generated by DBMS_STATS shows that a Hybrid histogram was chosen:
Trace file C:APPORADBAdiagrdbmsdelphixdbdelphixdbtracedelphixdb_ora_14648.trc
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
Windows NT Version V6.2
CPU : 4 - type 8664, 2 Physical Cores
Process Affinity : 0x0x0000000000000000
Memory (Avail/Total): Ph:10465M/20176M, Ph+PgF:15751M/40656M
Instance name: delphixdb
Redo thread mounted by this instance: 1
Oracle process number: 25
Windows thread id: 14648, image: ORACLE.EXE (SHAD)
*** 2016-09-08 16:19:41.944
*** SESSION ID:(124.22872) 2016-09-08 16:19:41.944
*** CLIENT ID:() 2016-09-08 16:19:41.944
*** SERVICE NAME:(SYS$USERS) 2016-09-08 16:19:41.944
*** MODULE NAME:(SQL*Plus) 2016-09-08 16:19:41.944
*** CLIENT DRIVER:(SQL*PLUS) 2016-09-08 16:19:41.944
*** ACTION NAME:() 2016-09-08 16:19:41.944
DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME
DBMS_STATS: Y Y ID
DBMS_STATS: Y Y Y Y Y Y Y Y N1
DBMS_STATS: Approximate NDV Options
DBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U20U
DBMS_STATS: start processing top n values for column "N1"
DBMS_STATS: >> frequency histograms is not feasible
(topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME
DBMS_STATS: ID
DBMS_STATS: Y Y Y Y N1
DBMS_STATS: Building Histogram for N1
DBMS_STATS: bktnum=20, nnv=100, snnv=100, sndv=37, est_ndv=37, mnb=20
DBMS_STATS: Trying hybrid histogram
DBMS_STATS: > cdn 100, popFreq 33, popCnt 5, bktSize 4.71428571428571428571428571428571428571, bktSzFrc .71428571428571428571428571428571428571
DBMS_STATS: Evaluating hybrid histogram: cht.count 20, mnb 20, ssize 100, min_ssize 2500, appr_ndv TRUE, ndv 37, selNdv 1, selFreq 6, pct 100, avg_bktsize 5, csr.hreq TRUE, normalize TRUE
DBMS_STATS: Histogram gathering flags: 7
DBMS_STATS: Accepting histogram
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
DBMS_STATS: ====================================================================================================
DBMS_STATS: Statistics from clist:
DBMS_STATS: ====================================================================================================
DBMS_STATS: Number of rows in the table = 100
DBMS_STATS: COL_NAME ACL NV SNNVDV NDV SNDV DENS CCNT
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: ID 3 0 100 NULL NULL 0 0
DBMS_STATS: min:
DBMS_STATS: max:
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: COL_NAME ACL NV SNNVDV NDV SNDV DENS CCNT
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: N1 3 0 100 37 37 .026167 1
DBMS_STATS: min: Typ=2 Len=2: c1,9
DBMS_STATS: max: Typ=2 Len=2: c1,3c
DBMS_STATS: Histograms:
DBMS_STATS: ---------------------------------------------------------------------------------------------------
DBMS_STATS: BVAL RPCNT EAVAL ENVAL EDVAL
DBMS_STATS: ---------------------------------------------------------------------------------------------------
DBMS_STATS: 1 1 C109 8 Typ=2 Len=2: c1,9
DBMS_STATS: 6 3 C10E 13 Typ=2 Len=2: c1,e
DBMS_STATS: 12 2 C113 18 Typ=2 Len=2: c1,13
DBMS_STATS: 20 5 C115 20 Typ=2 Len=2: c1,15
DBMS_STATS: 26 2 C118 23 Typ=2 Len=2: c1,18
DBMS_STATS: 32 3 C11B 26 Typ=2 Len=2: c1,1b
DBMS_STATS: 38 6 C11C 27 Typ=2 Len=2: c1,1c
DBMS_STATS: 44 6 C11D 28 Typ=2 Len=2: c1,1d
DBMS_STATS: 50 6 C11E 29 Typ=2 Len=2: c1,1e
DBMS_STATS: 58 5 C120 31 Typ=2 Len=2: c1,20
DBMS_STATS: 69 8 C122 33 Typ=2 Len=2: c1,22
DBMS_STATS: 79 7 C124 35 Typ=2 Len=2: c1,24
DBMS_STATS: 86 5 C127 38 Typ=2 Len=2: c1,27
DBMS_STATS: 90 1 C12A 41 Typ=2 Len=2: c1,2a
DBMS_STATS: 92 2 C12B 42 Typ=2 Len=2: c1,2b
DBMS_STATS: 95 3 C12C 43 Typ=2 Len=2: c1,2c
DBMS_STATS: 96 1 C12D 44 Typ=2 Len=2: c1,2d
DBMS_STATS: 97 1 C12E 45 Typ=2 Len=2: c1,2e
DBMS_STATS: 98 1 C12F 46 Typ=2 Len=2: c1,2f
DBMS_STATS: 100 1 C13C 59 Typ=2 Len=2: c1,3c
DBMS_STATS: Need Actual Values (DSC_EAVS)
DBMS_STATS: Histogram Type: HYBRID Data Type: 2
DBMS_STATS: Histogram Flags: 4 Histogram Gathering Flags: 6
DBMS_STATS: Incremental: FALSE Fix Control 13583722: 1
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: arlen = 6 ssize = 100
That decision was based on the data, and the fact that this data set doesn’t qualify for a TOP=Frequency histogram because the count for the top 20 values (20 being the number of buckets in the histogram) is less than the threshold calculated for that number of buckets:
SQL>
SQL> --
SQL> -- Return the count for the TOP 20 values
SQL> -- in the table
SQL> --
SQL> -- Used to see if the data qualifies for
SQL> -- a TOP-Frequency histogram
SQL> --
SQL> select
2 sum (cnt) TopNRows
3 from (select
4 n1
5 ,count(*) cnt
6 from t1
7 group by n1
8 order by count(*) desc
9 )
10 where rownum <= 20;
TOPNROWS
----------
80
Elapsed: 00:00:00.00
SQL>
SQL> --
SQL> -- Return the threshold value for a TOP-Frequency histogram
SQL> -- for this data set
SQL> --
SQL> -- The formula is: round((n-1)/n * 100)
SQL> --
SQL>
SQL> select round ((20-1)/20 * 100) threshold from dual;
THRESHOLD
----------
95
Elapsed: 00:00:00.00
SQL>
SQL> --
SQL> -- The threshold is greater than the count for this
SQL> -- data set
SQL> --
SQL> -- The data does not qualify for a TOP-Frequency histogram,
SQL> -- but does qualify for a Hybrid histogram
SQL> --
SQL> -- Compute the bucket size for the hybrid histogram
SQL> -- for this data set and display the relevant
SQL> -- column statistics
SQL> --
SQL> select
2 column_name
3 ,num_distinct
4 ,num_buckets
5 ,sample_size
6 ,round(sample_size/num_buckets,0) orig_bucket_size
7 ,histogram
8 from
9 user_tab_col_statistics
10 where table_name = 'T1'
11 and column_name ='N1';
COLUMN_NAME NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE ORIG_BUCKET_SIZE HISTOGRAM
----------------------------------- ------------ ----------- ----------- ---------------- ---------------
N1 37 20 100 5 HYBRID
Elapsed: 00:00:00.20
SQL>
The histogram creation isn’t finished, as Oracle now orders the column values in order to find values that cross buckets. Once such values are found new bucket sizes are computed for all but the boundary buckets, in this case bucket 0 and bucket 19:
SQL>
SQL> --
SQL> -- Four rules exist to create a hybrid histogram, and the
SQL> -- first two are:
SQL> --
SQL> -- a value should not be found in more than one bucket
SQL> -- a bucket size is allowed to be extended in order to contain
SQL> -- all instances of the same value
SQL> --
SQL> -- Compute the 'new' bucket size based on the data and
SQL> -- endpoint values
SQL> --
SQL> -- To illustrate how the bucket size adjustment is
SQL> -- calculated let's look at the value 13 in the data
SQL> --
SQL> -- Oracle 'walks' the ordered list of values until
SQL> -- it finds 13 in two buckets
SQL> --
SQL> -- Oracle extends the bucket where 13 is first found (bucket n)
SQL> -- and moves all other instances of 13 (from bucket n+1)
SQL> -- to this bucket (n)
SQL> --
SQL> -- Moving all instances of 13 to a single bucket
SQL> -- also makes 13 an endpoint value allowing Oracle
SQL> -- to compute the repeat count of that value
SQL> --
SQL> -- This continues on through the values so that
SQL> -- Oracle can place all occurrences of a distinct
SQL> -- value so they don't populate more than one bucket
SQL> --
SQL> SELECT
2 (row_number() over(order by ept_nbr)-1) NumBucket
3 ,ept_nbr
4 ,ept_act_val
5 ,rpt_cnt
6 ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size"
7 ,bucket_size "original bucket_size"
8 FROM
9 (SELECT
10 ah.endpoint_number ept_nbr
11 ,ah.endpoint_actual_value ept_act_val
12 ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag
13 ,ah.endpoint_repeat_count rpt_cnt
14 ,at.sample_size/at.num_buckets bucket_size
15 FROM
16 user_tab_histograms ah
17 ,user_tab_col_statistics at
18 WHERE ah.table_name = at.table_name
19 AND ah.column_name = at.column_name
20 AND ah.table_name = 'T1'
21 AND ah.column_name = 'N1'
22 ) ORDER BY ept_nbr;
NUMBUCKET EPT_NBR EPT_ACT_VAL RPT_CNT new bucket size original bucket_size
---------- ---------- ------------ ---------- --------------- --------------------
0 1 8 1 1 5
1 6 13 3 5 5
2 12 18 2 6 5
3 20 20 5 8 5
4 26 23 2 6 5
5 32 26 3 6 5
6 38 27 6 6 5
7 44 28 6 6 5
8 50 29 6 6 5
9 58 31 5 8 5
10 69 33 8 11 5
NUMBUCKET EPT_NBR EPT_ACT_VAL RPT_CNT new bucket size original bucket_size
---------- ---------- ------------ ---------- --------------- --------------------
11 79 35 7 10 5
12 86 38 5 7 5
13 90 41 1 4 5
14 92 42 2 2 5
15 95 43 3 3 5
16 96 44 1 1 5
17 97 45 1 1 5
18 98 46 1 1 5
19 100 59 1 2 5
20 rows selected.
Elapsed: 00:00:00.21
SQL>
Notice that for buckets 1-12 the bucket size increased; for the remaining non-boundary buckets the bucket size should decrease. But according to Rule 3 no bucket can be adjusted to a size smaller than the original bucket size so those buckets remain unadjusted. The bucket size computation in the query results below is based on the sample size divided by the number of buckets and, thus, displays a constant value for all buckets:
SQL>
SQL> --
SQL> -- Given that boundary values are the exception computing the new
SQL> -- bucket sizes shows that for bucket numbers 1-12 the sizes are
SQL> -- the same or larger than the originally computed size (5)
SQL> --
SQL> -- All remaining non-endpoint buckets compute to a smaller size
SQL> -- than the original
SQL> --
SQL> -- Now the third rule appears:
SQL> --
SQL> -- No non-endpoint bucket can be smaller than the originally
SQL> -- computed size
SQL> --
SQL> select
2 uth.endpoint_number
3 ,uth.endpoint_actual_value
4 ,uth.endpoint_repeat_count
5 ,ucs.sample_size/ucs.num_buckets bucket_size
6 ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
7 from
8 user_tab_histograms uth
9 ,user_tab_col_statistics ucs
10 where
11 uth.table_name = ucs.table_name
12 and uth.column_name = ucs.column_name
13 and uth.table_name = 'T1'
14 and uth.column_name = 'N1'
15 order by uth.endpoint_number;
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY
--------------- ------------ --------------------- ----------- ----------
1 8 1 5 -4
6 13 3 5 -2
12 18 2 5 -3
20 20 5 5 0
26 23 2 5 -3
32 26 3 5 -2
38 27 6 5 1
44 28 6 5 1
50 29 6 5 1
58 31 5 5 0
69 33 8 5 3
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY
--------------- ------------ --------------------- ----------- ----------
79 35 7 5 2
86 38 5 5 0
90 41 1 5 -4
92 42 2 5 -3
95 43 3 5 -2
96 44 1 5 -4
97 45 1 5 -4
98 46 1 5 -4
100 59 1 5 -4
20 rows selected.
Elapsed: 00:00:00.19
SQL>
For a Hybrid histogram three types of column value exist to base cardinality estimates upon: popular values, non-popular values that are endpoint and non-popular values that are not an endpoint. Let’s look at how Oracle estimates the cardinality for each case, starting with popular values:
SQL>
SQL> --
SQL> -- Display the endpoint, value, repeat count,
SQL> -- bucket size and the 'popularity'
SQL> --
SQL> -- Popularity is determined by computing the
SQL> -- difference between the frequency of appearance
SQL> -- (repeat count) and the bucket size (sample size/number of buckets)
SQL> -- Positive values are considered 'popular'
SQL> --
SQL> select
2 endpoint_number
3 ,endpoint_actual_value
4 ,endpoint_repeat_count
5 ,bucket_size
6 ,case when Popularity > 0 then 'Pop'
7 else 'Non-Pop'
8 end Popularity
9 from
10 (
11 select
12 uth.endpoint_number
13 ,uth.endpoint_actual_value
14 ,uth.endpoint_repeat_count
15 ,ucs.sample_size/ucs.num_buckets bucket_size
16 ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
17 from
18 user_tab_histograms uth
19 ,user_tab_col_statistics ucs
20 where
21 uth.table_name = ucs.table_name
22 and uth.column_name = ucs.column_name
23 and uth.table_name = 'T1'
24 and uth.column_name = 'N1'
25 )
26 order by endpoint_number;
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ------------ --------------------- ----------- -------
1 8 1 5 Non-Pop
6 13 3 5 Non-Pop
12 18 2 5 Non-Pop
20 20 5 5 Non-Pop
26 23 2 5 Non-Pop
32 26 3 5 Non-Pop
38 27 6 5 Pop
44 28 6 5 Pop
50 29 6 5 Pop
58 31 5 5 Non-Pop
69 33 8 5 Pop
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ------------ --------------------- ----------- -------
79 35 7 5 Pop
86 38 5 5 Non-Pop
90 41 1 5 Non-Pop
92 42 2 5 Non-Pop
95 43 3 5 Non-Pop
96 44 1 5 Non-Pop
97 45 1 5 Non-Pop
98 46 1 5 Non-Pop
100 59 1 5 Non-Pop
20 rows selected.
Elapsed: 00:00:00.19
SQL>
Let’s return only the popular values and work through the cardinality estimate Oracle generates:
SQL> --
SQL> -- Using the popularity Oracle estimates the cardinality
SQL> -- by considering the following three types of values:
SQL> --
SQL> -- popular value
SQL> -- non-popular value with an endpoint number
SQL> -- non-popular value not present in the histogram table
SQL> --
SQL> -- Case 1: Popular values
SQL> --
SQL> -- Display the 'popular' values in this data set
SQL> --
SQL> select
2 endpoint_actual_value
3 ,endpoint_repeat_count
4 ,bucket_size
5 ,Popularity
6 from
7 (
8 select
9 endpoint_number
10 ,endpoint_actual_value
11 ,endpoint_repeat_count
12 ,bucket_size
13 ,case when Popularity > 0 then 'Pop'
14 else 'Non-Pop'
15 end Popularity
16 from
17 (
18 select
19 uth.endpoint_number
20 ,uth.endpoint_actual_value
21 ,uth.endpoint_repeat_count
22 ,ucs.sample_size/ucs.num_buckets bucket_size
23 ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
24 from
25 user_tab_histograms uth
26 ,user_tab_col_statistics ucs
27 where
28 uth.table_name = ucs.table_name
29 and uth.column_name = ucs.column_name
30 and uth.table_name = 'T1'
31 and uth.column_name = 'N1'
32 )
33 )
34 where Popularity = 'Pop';
ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
27 6 5 Pop
28 6 5 Pop
29 6 5 Pop
33 8 5 Pop
35 7 5 Pop
Elapsed: 00:00:00.20
SQL>
Looking at one of these popular values (33) let’s return the cardinality estimate Oracle calculated:
SQL> --
SQL> -- Using explain plan the cardinality estimation
SQL> -- can be displayed for one of the 'popular' values
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 33;
Explained.
Elapsed: 00:00:00.00
SQL>
SQL> --
SQL> -- The cardinality estimate displayed below was
SQL> -- calculated using the following formula:
SQL> --
SQL> -- E-Rows = ENDPOINT_REPEAT_COUNT * num_rows/sample_size
SQL> -- E-Rows = 8 * 100/100 = 8
SQL> --
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 3 | | |
|* 2 | TABLE ACCESS FULL| T1 | 8 | 24 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------
2 - filter("N1"=33)
14 rows selected.
Elapsed: 00:00:00.03
SQL>
From the formula found in the above comments our calculated cardinality matches with that provided by the optimizer, so our understanding of the Hybrid histogram seems sound. To continue that verification we consider the case where we have a non-popular value that is an endpoint. The formula for that cardinality calculation is: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size). We calculate the NewDensity with a query based on the work of Alberto Dell’Era. First we return the non-popular values that are endpoints:
SQL> --
SQL> -- Case 2: Non-popular values with an endpoint
SQL> --
SQL> -- First return all non-popular values that are an
SQL> -- endpoint
SQL> --
SQL> select
2 endpoint_actual_value
3 ,endpoint_repeat_count
4 ,bucket_size
5 ,Popularity
6 from
7 (
8 select
9 endpoint_number
10 ,endpoint_actual_value
11 ,endpoint_repeat_count
12 ,bucket_size
13 ,case when Popularity > 0 then 'Pop'
14 else 'Non-Pop'
15 end Popularity
16 from
17 (
18 select
19 uth.endpoint_number
20 ,uth.endpoint_actual_value
21 ,uth.endpoint_repeat_count
22 ,ucs.sample_size/ucs.num_buckets bucket_size
23 ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
24 from
25 user_tab_histograms uth
26 ,user_tab_col_statistics ucs
27 where
28 uth.table_name = ucs.table_name
29 and uth.column_name = ucs.column_name
30 and uth.table_name = 'T1'
31 and uth.column_name = 'N1'
32 )
33 )
34 where Popularity = 'Non-Pop';
ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
8 1 5 Non-Pop
13 3 5 Non-Pop
18 2 5 Non-Pop
20 5 5 Non-Pop
23 2 5 Non-Pop
26 3 5 Non-Pop
31 5 5 Non-Pop
38 5 5 Non-Pop
41 1 5 Non-Pop
42 2 5 Non-Pop
43 3 5 Non-Pop
ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
44 1 5 Non-Pop
45 1 5 Non-Pop
46 1 5 Non-Pop
59 1 5 Non-Pop
15 rows selected.
Elapsed: 00:00:00.20
SQL>
Let’s return Oracle’s estimated cardinality for one of these values (45):
SQL> --
SQL> -- Use explain plan to return the estimated cardinality
SQL> --
SQL> -- In this case the estimated cardinality is computed as:
SQL> --
SQL> -- E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size)
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 45;
Explained.
Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 3 | | |
|* 2 | TABLE ACCESS FULL| T1 | 2 | 6 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------
2 - filter("N1"=45)
14 rows selected.
Elapsed: 00:00:00.02
SQL>
Now let’s use our query to return the value for NewDensity:
SQL> --
SQL> -- The NewDensity is computed by an internal algorithm but
SQL> -- we can get a reliable density using the following query
SQL> -- based on work done by Alberto Dell'Era:
SQL> --
SQL> SELECT
2 BktCnt
3 ,PopBktCnt
4 ,PopValCnt
5 ,NDV
6 ,pop_bucketSize
7 ,trunc(((BktCnt-PopBktCnt)/BktCnt)/(NDV-PopValCnt),10) NewDensity
8 FROM
9 (SELECT
10 COUNT(1) PopValCnt,
11 SUM(endpoint_repeat_count) PopBktCnt,
12 ndv,
13 BktCnt,
14 pop_bucketSize
15 FROM
16 (SELECT
17 (sample_size - num_nulls) BktCnt,
18 num_distinct ndv,
19 num_buckets,
20 density OldDensity,
21 (sample_size-num_nulls)/num_buckets pop_bucketSize
22 FROM user_tab_col_statistics
23 WHERE
24 table_name = 'T1'
25 AND column_name = 'N1'
26 ),
27 user_histograms
28 WHERE table_name = 'T1'
29 AND column_name = 'N1'
30 AND endpoint_repeat_count> pop_bucketSize
31 GROUP BY ndv,
32 BktCnt,
33 pop_bucketSize
34 );
BKTCNT POPBKTCNT POPVALCNT NDV POP_BUCKETSIZE NEWDENSITY
---------- ---------- ---------- ---------- -------------- ----------
100 33 5 37 5 .0209375
Elapsed: 00:00:00.19
SQL>
We now see if our calculation returned a valid value by ‘plugging’ it into the following equation: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size). The cardinality estimate we compute is:
SQL> --
SQL> -- 'Plugging' this new density into the equation above produces:
SQL> --
SQL> -- E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size)
SQL> -- E-Rows = 100 * greatest (.0209375, 1/100) = 2.09375 ~ 2
SQL> --
SQL> -- which matches the value returned by Oracle
SQL> --
SQL>
SQL> --
SQL> -- To validate these findings use a different non-popular value
SQL> -- that's also an endpoint:
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 43;
Explained.
Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 3 | | |
|* 2 | TABLE ACCESS FULL| T1 | 3 | 9 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------
2 - filter("N1"=43)
14 rows selected.
Elapsed: 00:00:00.02
SQL>
Oracle returned 3 as the estimated cardinality and so did our calculation, more proof that our concept of the Hybrid histogram is correct. One final test will prove how well we understand this histogram, considering non-popular values that are not endpoints. Using our calculated value of NewDensity in this equation: E-Rows = num_rows * NewDensity gives us the following result: E-Rows = 100 * .0209375 = 2.09375 ~ 2. Time to see if our calculated value matches that which Oracle has estimated:
SQL>
SQL> --
SQL> -- Case 3: Non-popular value without an endpoint number
SQL> --
SQL> -- This calculation is fairly simple and straightforward:
SQL> --
SQL> -- E-Rows = num_rows * NewDensity = 100 * .0209375 = 2.09375 ~ 2
SQL> --
SQL>
SQL> explain plan for select count(1) from t1 where n1 = 17;
Explained.
Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 3 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 3 | | |
|* 2 | TABLE ACCESS FULL| T1 | 2 | 6 | 3 (0)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------
2 - filter("N1"=17)
14 rows selected.
Elapsed: 00:00:00.02
SQL>
Our calculated value, as evidenced by the Rows value in the execution plan, matches Oracle’s estimate, giving further evidence that our contcspt of the Hybrid histogram is sound.
`
As mentioned earlier in this article Hybrid histograms can exhibit instability in the form of endpoint changes based on the changing frequency of values as inserts are executed againt the table. To illustrate this let’s add another 16 to the data and see what happens:
SQL>
SQL> --
SQL> -- Hybrid histograms can display instability in terms of
SQL> -- endpoints and bucket size
SQL> --
SQL> -- To illustrate this we add 16 to the data, increasing
SQL> -- the repeat count to 3, causing this newly added 16
SQL> -- to shift into the previous bucket leaving 17
SQL> -- as the new endpoint
SQL> --
SQL> insert into t1 values (2, 16);
1 row created.
Elapsed: 00:00:00.00
SQL>
SQL> BEGIN
2 dbms_stats.gather_table_stats
3 (user, 't1', method_opt => 'for all columns size 20');
4 END;
5 /
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.08
SQL>
SQL> select
2 uth.endpoint_number
3 ,uth.endpoint_actual_value
4 ,uth.endpoint_repeat_count
5 from
6 user_tab_histograms uth
7 ,user_tables ut
8 ,user_tab_col_statistics ucs
9 where
10 uth.table_name = 'T1'
11 and uth.column_name = 'N1'
12 and uth.table_name = ut.table_name
13 and ut.table_name = ucs.table_name
14 and uth.column_name = ucs.column_name
15 order by uth.endpoint_number;
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
1 8 1
6 13 3
11 17 1 <----
16 19 3
21 20 5
27 23 2
33 26 3
39 27 6
45 28 6
51 29 6
59 31 5
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
70 33 8
80 35 7
87 38 5
91 41 1
96 43 3
97 44 1
98 45 1
99 46 1
101 59 1
20 rows selected.
Elapsed: 00:00:00.24
The endpoint value of 18 has been replaced by the value 17, because of Rule 2, that no value can span buckets. Before the insert the histogram bucket looked like this: 8 12 12 13 13 13|15 16 16 17 18 18|19 19 19 20 20 20 20 20| … After the insert the ordered value list looked like this: 8 12 12 13 13|13 15 16 16 16|17 18 18 19 19 19 … After Oracle ‘walks’ through the list to consolidate like values the buckets look like this: 8 12 12 13 13 13| 15 16 16 16 17 |18 18 19 19 19… giving a new endpoint number of 17. If we add yet another 16 to the data set we see:
SQL>
SQL> --
SQL> -- Add one more 16 and the endpoint shifts to 16
SQL> --
SQL> insert into t1 values (3, 16);
1 row created.
Elapsed: 00:00:00.00
SQL>
SQL> BEGIN
2 dbms_stats.gather_table_stats
3 (user, 't1', method_opt => 'for all columns size 20');
4 END;
5 /
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.03
SQL>
SQL> select
2 uth.endpoint_number
3 ,uth.endpoint_actual_value
4 ,uth.endpoint_repeat_count
5 from
6 user_tab_histograms uth
7 ,user_tables ut
8 ,user_tab_col_statistics ucs
9 where
10 uth.table_name = 'T1'
11 and uth.column_name = 'N1'
12 and uth.table_name = ut.table_name
13 and ut.table_name = ucs.table_name
14 and uth.column_name = ucs.column_name
15 order by uth.endpoint_number;
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
1 8 1
6 13 3
11 16 4 <-----
17 19 3
22 20 5
28 23 2
34 26 3
40 27 6
46 28 6
52 29 6
60 31 5
ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
71 33 8
81 35 7
88 38 5
92 41 1
97 43 3
98 44 1
99 45 1
100 46 1
102 59 1
20 rows selected.
Elapsed: 00:00:00.22
SQL>
The endpoint has shifted again and is now 16.
Hybrid histograms combine the accuracy of frequency histograms along with the space savings provided by height-balanced histograms by the mechanism of consolidating all like values into a single bucket. This generates more values that are considered ‘popular’ by the optimizer, which in turn offers the optimizer a better opportunity to generate an optimal execution path. Even with the demonstrated instability of endpoint values the Optimizer is more likely to produce a ‘good’ execution plan using the Hybrid histogram. It’s a histogram option worth serious consideration.