Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Slideshows
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

Oracle

Posted September 26, 2016

Hybrid Histograms in Oracle 12c

By David Fitzjarrell

A useful new feature in Oracle 12c (12.1.0.2) is the Hybrid histogram, a new type of histogram that uses a reduced number of buckets compared to the number of distinct values in the table data and utilizes the repeat frequency of the endpoint values. This type of histogram can adjust the bucket size based on endpoint values due to the criteria Oracle sets when it generates such a histogram. Unlike other histograms Oracle can generate, the Hybrid histogram is based on four 'rules', which are:

  1. a value should not be found in more than one bucket
  2. the bucket size is allowed to be extended in order to contain all instances of the same distinct value
  3. adjusted bucket size cannot be less than the original size (not applicable at either end of the data set)
  4. the original number of buckets should not be reduced

Let's look at how Oracle decides on, and creates, a Hybrid histogram. Additionally we'll look at the instability of a Hybrid histogram as values are added to the data set. We take the table and data from an example by Jonathan Lewis:


SQL>
SQL> --
SQL> -- Create and populate a table so we can
SQL> -- build a hybrid histogram
SQL> --
SQL> -- Thanks to Jonathan Lewis for generating
SQL> -- this table and data set
SQL> --
SQL> create table t1 (id number, n1 number);

Table created.

Elapsed: 00:00:00.01
SQL>
SQL> @InsT1.sql

Elapsed: 00:00:00.00
SQL>

To create a hybrid histogram we choose a number of buckets less than the number of distinct values in the table:


SQL>
SQL> --
SQL> -- Return the number of distinct values
SQL> -- in column n1
SQL> --
SQL>
SQL> select count(distinct n1) from t1;

COUNT(DISTINCTN1)
-----------------
               37

Elapsed: 00:00:00.00
SQL>

Let's choose 20 buckets for our histogram and use DBMS_STATS to create it; let's also trace the DBMS_STATS session to see what decisions it makes with respect to the data:


SQL>
SQL> --
SQL> -- Create a hybrid histogram on column n1
SQL> -- by setting the number of buckets less
SQL> -- than the number of distinct values
SQL> --
SQL> -- Trace the DBMS_STATS session to see if
SQL> -- a hybrid histogram is selected and
SQL> -- generated
SQL> --
SQL> exec dbms_stats.set_global_prefs ('TRACE', to_char (16));

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL> BEGIN
  2  	dbms_stats.gather_table_stats
  3  	  (user, 't1', method_opt => 'for columns n1 size 20');
  4  END;
  5  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.43
SQL> exec dbms_stats.set_global_prefs('TRACE', null);

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.01
SQL>

The trace file generated by DBMS_STATS shows that a Hybrid histogram was chosen:

Trace file C:\APP\ORADBA\diag\rdbms\delphixdb\delphixdb\trace\delphixdb_ora_14648.trc
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options
Windows NT Version V6.2
CPU                 : 4 - type 8664, 2 Physical Cores
Process Affinity    : 0x0x0000000000000000
Memory (Avail/Total): Ph:10465M/20176M, Ph+PgF:15751M/40656M
Instance name: delphixdb
Redo thread mounted by this instance: 1
Oracle process number: 25
Windows thread id: 14648, image: ORACLE.EXE (SHAD)


*** 2016-09-08 16:19:41.944
*** SESSION ID:(124.22872) 2016-09-08 16:19:41.944
*** CLIENT ID:() 2016-09-08 16:19:41.944
*** SERVICE NAME:(SYS$USERS) 2016-09-08 16:19:41.944
*** MODULE NAME:(SQL*Plus) 2016-09-08 16:19:41.944
*** CLIENT DRIVER:(SQL*PLUS) 2016-09-08 16:19:41.944
*** ACTION NAME:() 2016-09-08 16:19:41.944

DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:   Y         Y                                                 ID
DBMS_STATS:   Y    Y    Y    Y    Y    Y    Y                        Y    N1
DBMS_STATS: Approximate NDV Options
DBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U20U
DBMS_STATS: start processing top n values for column "N1"
DBMS_STATS:   >> frequency histograms is not feasible
                       (topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:                                                               ID
DBMS_STATS:                       Y    Y    Y                        Y    N1
DBMS_STATS: Building Histogram for N1
DBMS_STATS:  bktnum=20, nnv=100, snnv=100, sndv=37, est_ndv=37, mnb=20
DBMS_STATS:  Trying hybrid histogram
DBMS_STATS:  > cdn 100, popFreq 33, popCnt 5, bktSize 4.71428571428571428571428571428571428571, bktSzFrc .71428571428571428571428571428571428571
DBMS_STATS:  Evaluating hybrid histogram:  cht.count 20, mnb 20, ssize 100, min_ssize 2500, appr_ndv  TRUE, ndv 37, selNdv 1, selFreq 6, pct 100, avg_bktsize 5, csr.hreq TRUE, normalize TRUE
DBMS_STATS:   Histogram gathering flags: 7
DBMS_STATS:  Accepting histogram
DBMS_STATS: Start fill_cstats - hybrid_enabled: TRUE
DBMS_STATS: ====================================================================================================
DBMS_STATS: Statistics from clist:
DBMS_STATS: ====================================================================================================
DBMS_STATS: Number of rows in the table = 100
DBMS_STATS: COL_NAME                       ACL       NV        SNNVDV    NDV       SNDV      DENS      CCNT
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: ID                             3         0         100       NULL      NULL      0         0
DBMS_STATS:          min:
DBMS_STATS:          max:
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: COL_NAME                       ACL       NV        SNNVDV    NDV       SNDV      DENS      CCNT
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: N1                             3         0         100       37        37        .026167   1
DBMS_STATS:          min: Typ=2 Len=2: c1,9
DBMS_STATS:          max: Typ=2 Len=2: c1,3c
DBMS_STATS:          Histograms:
DBMS_STATS:          ---------------------------------------------------------------------------------------------------
DBMS_STATS:          BVAL   RPCNT    EAVAL                                                            ENVAL    EDVAL
DBMS_STATS:          ---------------------------------------------------------------------------------------------------
DBMS_STATS:          1      1        C109                                                             8        Typ=2 Len=2: c1,9
DBMS_STATS:          6      3        C10E                                                             13       Typ=2 Len=2: c1,e
DBMS_STATS:          12     2        C113                                                             18       Typ=2 Len=2: c1,13
DBMS_STATS:          20     5        C115                                                             20       Typ=2 Len=2: c1,15
DBMS_STATS:          26     2        C118                                                             23       Typ=2 Len=2: c1,18
DBMS_STATS:          32     3        C11B                                                             26       Typ=2 Len=2: c1,1b
DBMS_STATS:          38     6        C11C                                                             27       Typ=2 Len=2: c1,1c
DBMS_STATS:          44     6        C11D                                                             28       Typ=2 Len=2: c1,1d
DBMS_STATS:          50     6        C11E                                                             29       Typ=2 Len=2: c1,1e
DBMS_STATS:          58     5        C120                                                             31       Typ=2 Len=2: c1,20
DBMS_STATS:          69     8        C122                                                             33       Typ=2 Len=2: c1,22
DBMS_STATS:          79     7        C124                                                             35       Typ=2 Len=2: c1,24
DBMS_STATS:          86     5        C127                                                             38       Typ=2 Len=2: c1,27
DBMS_STATS:          90     1        C12A                                                             41       Typ=2 Len=2: c1,2a
DBMS_STATS:          92     2        C12B                                                             42       Typ=2 Len=2: c1,2b
DBMS_STATS:          95     3        C12C                                                             43       Typ=2 Len=2: c1,2c
DBMS_STATS:          96     1        C12D                                                             44       Typ=2 Len=2: c1,2d
DBMS_STATS:          97     1        C12E                                                             45       Typ=2 Len=2: c1,2e
DBMS_STATS:          98     1        C12F                                                             46       Typ=2 Len=2: c1,2f
DBMS_STATS:          100    1        C13C                                                             59       Typ=2 Len=2: c1,3c
DBMS_STATS:          Need Actual Values (DSC_EAVS)
DBMS_STATS:          Histogram Type: HYBRID Data Type: 2
DBMS_STATS:          Histogram Flags: 4 Histogram Gathering Flags: 6
DBMS_STATS:          Incremental: FALSE Fix Control 13583722: 1
DBMS_STATS: ----------------------------------------------------------------------------------------------------
DBMS_STATS: arlen = 6 ssize = 100

That decision was based on the data, and the fact that this data set doesn't qualify for a TOP=Frequency histogram because the count for the top 20 values (20 being the number of buckets in the histogram) is less than the threshold calculated for that number of buckets:


SQL>
SQL> --
SQL> -- Return the count for the TOP 20 values
SQL> -- in the table
SQL> --
SQL> -- Used to see if the data qualifies for
SQL> -- a TOP-Frequency histogram
SQL> --
SQL> select
  2  	      sum (cnt) TopNRows
  3  from (select
  4  		 n1
  5  		,count(*) cnt
  6  	      from t1
  7  	      group by n1
  8  	       order by count(*) desc
  9  	  )
 10  where rownum <= 20;

  TOPNROWS
----------
        80

Elapsed: 00:00:00.00
SQL>
SQL> --
SQL> -- Return the threshold value for a TOP-Frequency histogram
SQL> -- for this data set
SQL> --
SQL> -- The formula is: round((n-1)/n * 100)
SQL> --
SQL>
SQL> select round ((20-1)/20 * 100) threshold from dual;

 THRESHOLD
----------
        95

Elapsed: 00:00:00.00
SQL>
SQL> --
SQL> -- The threshold is greater than the count for this
SQL> -- data set
SQL> --
SQL> -- The data does not qualify for a TOP-Frequency histogram,
SQL> -- but does qualify for a Hybrid histogram
SQL> --
SQL> -- Compute the bucket size for the hybrid histogram
SQL> -- for this data set and display the relevant
SQL> -- column statistics
SQL> --
SQL> select
  2  	    column_name
  3  	   ,num_distinct
  4  	   ,num_buckets
  5  	   ,sample_size
  6  	   ,round(sample_size/num_buckets,0) orig_bucket_size
  7  	   ,histogram
  8  	 from
  9  	    user_tab_col_statistics
 10  	 where table_name = 'T1'
 11  	 and column_name  ='N1';

COLUMN_NAME                         NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE ORIG_BUCKET_SIZE HISTOGRAM
----------------------------------- ------------ ----------- ----------- ---------------- ---------------
N1                                            37          20         100                5 HYBRID

Elapsed: 00:00:00.20
SQL>

The histogram creation isn't finished, as Oracle now orders the column values in order to find values that cross buckets. Once such values are found new bucket sizes are computed for all but the boundary buckets, in this case bucket 0 and bucket 19:


SQL>
SQL> --
SQL> -- Four rules exist to create a hybrid histogram, and the
SQL> -- first two are:
SQL> --
SQL> -- a value should not be found in more than one bucket
SQL> -- a bucket size is allowed to be extended in order to contain
SQL> --   all instances of the same value
SQL> --
SQL> -- Compute the 'new' bucket size based on the data and
SQL> -- endpoint values
SQL> --
SQL> -- To illustrate how the bucket size adjustment is
SQL> -- calculated let's look at the value 13 in the data
SQL> --
SQL> -- Oracle 'walks' the ordered list of values until
SQL> -- it finds 13 in two buckets
SQL> --
SQL> -- Oracle extends the bucket where 13 is first found (bucket n)
SQL> -- and moves all other instances of 13 (from bucket n+1)
SQL> -- to this bucket (n)
SQL> --
SQL> -- Moving all instances of 13 to a single bucket
SQL> -- also makes 13 an endpoint value allowing Oracle
SQL> -- to compute the repeat count of that value
SQL> --
SQL> -- This continues on through the values so that
SQL> -- Oracle can place all occurrences of a distinct
SQL> -- value so they don't populate more than one bucket
SQL> --
SQL> SELECT
  2  	     (row_number() over(order by ept_nbr)-1) NumBucket
  3  	     ,ept_nbr
  4  	     ,ept_act_val
  5  	     ,rpt_cnt
  6  	     ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size"
  7  	     ,bucket_size "original bucket_size"
  8  	 FROM
  9  	     (SELECT
 10  		  ah.endpoint_number		ept_nbr
 11  		 ,ah.endpoint_actual_value	ept_act_val
 12  		 ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag
 13  		 ,ah.endpoint_repeat_count rpt_cnt
 14  		 ,at.sample_size/at.num_buckets bucket_size
 15  	      FROM
 16  		 user_tab_histograms	  ah
 17  		,user_tab_col_statistics  at
 18  	      WHERE ah.table_name  = at.table_name
 19  	      AND ah.column_name = at.column_name
 20  	      AND ah.table_name  = 'T1'
 21  	      AND ah.column_name = 'N1'
 22  	    ) ORDER BY ept_nbr;

 NUMBUCKET    EPT_NBR EPT_ACT_VAL     RPT_CNT new bucket size original bucket_size
---------- ---------- ------------ ---------- --------------- --------------------
         0          1 8                     1               1                    5
         1          6 13                    3               5                    5
         2         12 18                    2               6                    5
         3         20 20                    5               8                    5
         4         26 23                    2               6                    5
         5         32 26                    3               6                    5
         6         38 27                    6               6                    5
         7         44 28                    6               6                    5
         8         50 29                    6               6                    5
         9         58 31                    5               8                    5
        10         69 33                    8              11                    5

 NUMBUCKET    EPT_NBR EPT_ACT_VAL     RPT_CNT new bucket size original bucket_size
---------- ---------- ------------ ---------- --------------- --------------------
        11         79 35                    7              10                    5
        12         86 38                    5               7                    5
        13         90 41                    1               4                    5
        14         92 42                    2               2                    5
        15         95 43                    3               3                    5
        16         96 44                    1               1                    5
        17         97 45                    1               1                    5
        18         98 46                    1               1                    5
        19        100 59                    1               2                    5

20 rows selected.

Elapsed: 00:00:00.21
SQL>

Notice that for buckets 1-12 the bucket size increased; for the remaining non-boundary buckets the bucket size should decrease. But according to Rule 3 no bucket can be adjusted to a size smaller than the original bucket size so those buckets remain unadjusted. The bucket size computation in the query results below is based on the sample size divided by the number of buckets and, thus, displays a constant value for all buckets:


SQL>
SQL> --
SQL> -- Given that boundary values are the exception computing the new
SQL> -- bucket sizes shows that for bucket numbers 1-12 the sizes are
SQL> -- the same or larger than the originally computed size (5)
SQL> --
SQL> -- All remaining non-endpoint buckets compute to a smaller size
SQL> -- than the original
SQL> --
SQL> -- Now the third rule appears:
SQL> --
SQL> -- No non-endpoint bucket can be smaller than the originally
SQL> -- computed size
SQL> --
SQL> select
  2  	      uth.endpoint_number
  3  	     ,uth.endpoint_actual_value
  4  	     ,uth.endpoint_repeat_count
  5  	     ,ucs.sample_size/ucs.num_buckets bucket_size
  6  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
  7  	 from
  8  	     user_tab_histograms uth
  9  	    ,user_tab_col_statistics ucs
 10  	where
 11  	     uth.table_name   = ucs.table_name
 12  	 and uth.column_name   = ucs.column_name
 13  	 and uth.table_name    = 'T1'
 14  	 and uth.column_name   = 'N1'
 15  	order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY
--------------- ------------ --------------------- ----------- ----------
              1 8                                1           5         -4
              6 13                               3           5         -2
             12 18                               2           5         -3
             20 20                               5           5          0
             26 23                               2           5         -3
             32 26                               3           5         -2
             38 27                               6           5          1
             44 28                               6           5          1
             50 29                               6           5          1
             58 31                               5           5          0
             69 33                               8           5          3

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY
--------------- ------------ --------------------- ----------- ----------
             79 35                               7           5          2
             86 38                               5           5          0
             90 41                               1           5         -4
             92 42                               2           5         -3
             95 43                               3           5         -2
             96 44                               1           5         -4
             97 45                               1           5         -4
             98 46                               1           5         -4
            100 59                               1           5         -4

20 rows selected.

Elapsed: 00:00:00.19
SQL>

For a Hybrid histogram three types of column value exist to base cardinality estimates upon: popular values, non-popular values that are endpoint and non-popular values that are not an endpoint. Let's look at how Oracle estimates the cardinality for each case, starting with popular values:


SQL>
SQL> --
SQL> -- Display the endpoint, value, repeat count,
SQL> -- bucket size and the 'popularity'
SQL> --
SQL> -- Popularity is determined by computing the
SQL> -- difference between the frequency of appearance
SQL> -- (repeat count) and the bucket size (sample size/number of buckets)
SQL> -- Positive values are considered 'popular'
SQL> --
SQL> select
  2  	      endpoint_number
  3  	     ,endpoint_actual_value
  4  	     ,endpoint_repeat_count
  5  	     ,bucket_size
  6  	     ,case when Popularity > 0 then 'Pop'
  7  			else 'Non-Pop'
  8  	       end Popularity
  9  	 from
 10  	(
 11  	  select
 12  	      uth.endpoint_number
 13  	     ,uth.endpoint_actual_value
 14  	     ,uth.endpoint_repeat_count
 15  	     ,ucs.sample_size/ucs.num_buckets bucket_size
 16  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
 17  	 from
 18  	     user_tab_histograms uth
 19  	    ,user_tab_col_statistics ucs
 20  	where
 21  	     uth.table_name   = ucs.table_name
 22  	 and uth.column_name   = ucs.column_name
 23  	 and uth.table_name    = 'T1'
 24  	 and uth.column_name   = 'N1'
 25  	 )
 26  	order by endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ------------ --------------------- ----------- -------
              1 8                                1           5 Non-Pop
              6 13                               3           5 Non-Pop
             12 18                               2           5 Non-Pop
             20 20                               5           5 Non-Pop
             26 23                               2           5 Non-Pop
             32 26                               3           5 Non-Pop
             38 27                               6           5 Pop
             44 28                               6           5 Pop
             50 29                               6           5 Pop
             58 31                               5           5 Non-Pop
             69 33                               8           5 Pop

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ------------ --------------------- ----------- -------
             79 35                               7           5 Pop
             86 38                               5           5 Non-Pop
             90 41                               1           5 Non-Pop
             92 42                               2           5 Non-Pop
             95 43                               3           5 Non-Pop
             96 44                               1           5 Non-Pop
             97 45                               1           5 Non-Pop
             98 46                               1           5 Non-Pop
            100 59                               1           5 Non-Pop

20 rows selected.

Elapsed: 00:00:00.19
SQL>

Let's return only the popular values and work through the cardinality estimate Oracle generates:


SQL> --
SQL> -- Using the popularity Oracle estimates the cardinality
SQL> -- by considering the following three types of values:
SQL> --
SQL> -- popular value
SQL> -- non-popular value with an endpoint number
SQL> -- non-popular value not present in the histogram table
SQL> --
SQL> -- Case 1: Popular values
SQL> --
SQL> -- Display the 'popular' values in this data set
SQL> --
SQL> select
  2  	      endpoint_actual_value
  3  	     ,endpoint_repeat_count
  4  	     ,bucket_size
  5  	     ,Popularity
  6  from
  7  (
  8    select
  9  	      endpoint_number
 10  	     ,endpoint_actual_value
 11  	     ,endpoint_repeat_count
 12  	     ,bucket_size
 13  	     ,case when Popularity > 0 then 'Pop'
 14  			else 'Non-Pop'
 15  	       end Popularity
 16  	 from
 17  	(
 18  	  select
 19  	      uth.endpoint_number
 20  	     ,uth.endpoint_actual_value
 21  	     ,uth.endpoint_repeat_count
 22  	     ,ucs.sample_size/ucs.num_buckets bucket_size
 23  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
 24  	 from
 25  	     user_tab_histograms uth
 26  	    ,user_tab_col_statistics ucs
 27  	where
 28  	     uth.table_name   = ucs.table_name
 29  	 and uth.column_name   = ucs.column_name
 30  	 and uth.table_name    = 'T1'
 31  	 and uth.column_name   = 'N1'
 32  	 )
 33    )
 34  where Popularity = 'Pop';

ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
27                               6           5 Pop
28                               6           5 Pop
29                               6           5 Pop
33                               8           5 Pop
35                               7           5 Pop

Elapsed: 00:00:00.20
SQL>

Looking at one of these popular values (33) let's return the cardinality estimate Oracle calculated:


SQL> --
SQL> -- Using explain plan the cardinality estimation
SQL> -- can be displayed for one of the 'popular' values
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 33;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> --
SQL> -- The cardinality estimate displayed below was
SQL> -- calculated using the following formula:
SQL> --
SQL> -- E-Rows = ENDPOINT_REPEAT_COUNT * num_rows/sample_size
SQL> -- E-Rows = 8 * 100/100 = 8
SQL> --
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     8 |    24 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - filter("N1"=33)

14 rows selected.

Elapsed: 00:00:00.03
SQL>

From the formula found in the above comments our calculated cardinality matches with that provided by the optimizer, so our understanding of the Hybrid histogram seems sound. To continue that verification we consider the case where we have a non-popular value that is an endpoint. The formula for that cardinality calculation is: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size). We calculate the NewDensity with a query based on the work of Alberto Dell'Era. First we return the non-popular values that are endpoints:


SQL> --
SQL> -- Case 2: Non-popular values with an endpoint
SQL> --
SQL> -- First return all non-popular values that are an
SQL> -- endpoint
SQL> --
SQL> select
  2  	      endpoint_actual_value
  3  	     ,endpoint_repeat_count
  4  	     ,bucket_size
  5  	     ,Popularity
  6  from
  7  (
  8    select
  9  	      endpoint_number
 10  	     ,endpoint_actual_value
 11  	     ,endpoint_repeat_count
 12  	     ,bucket_size
 13  	     ,case when Popularity > 0 then 'Pop'
 14  			else 'Non-Pop'
 15  	       end Popularity
 16  	 from
 17  	(
 18  	  select
 19  	      uth.endpoint_number
 20  	     ,uth.endpoint_actual_value
 21  	     ,uth.endpoint_repeat_count
 22  	     ,ucs.sample_size/ucs.num_buckets bucket_size
 23  	     ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
 24  	 from
 25  	     user_tab_histograms uth
 26  	    ,user_tab_col_statistics ucs
 27  	where
 28  	     uth.table_name   = ucs.table_name
 29  	 and uth.column_name   = ucs.column_name
 30  	 and uth.table_name    = 'T1'
 31  	 and uth.column_name   = 'N1'
 32  	 )
 33    )
 34  where Popularity = 'Non-Pop';

ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
8                                1           5 Non-Pop
13                               3           5 Non-Pop
18                               2           5 Non-Pop
20                               5           5 Non-Pop
23                               2           5 Non-Pop
26                               3           5 Non-Pop
31                               5           5 Non-Pop
38                               5           5 Non-Pop
41                               1           5 Non-Pop
42                               2           5 Non-Pop
43                               3           5 Non-Pop

ENDPOINT_ACT ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
------------ --------------------- ----------- -------
44                               1           5 Non-Pop
45                               1           5 Non-Pop
46                               1           5 Non-Pop
59                               1           5 Non-Pop

15 rows selected.

Elapsed: 00:00:00.20
SQL>

Let's return Oracle's estimated cardinality for one of these values (45):


SQL> --
SQL> -- Use explain plan to return the estimated cardinality
SQL> --
SQL> -- In this case the estimated cardinality is computed as:
SQL> --
SQL> -- E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size)
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 45;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     2 |     6 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - filter("N1"=45)

14 rows selected.

Elapsed: 00:00:00.02
SQL>

Now let's use our query to return the value for NewDensity:


SQL> --
SQL> -- The NewDensity is computed by an internal algorithm but
SQL> -- we can get a reliable density using the following query
SQL> -- based on work done by Alberto Dell'Era:
SQL> --
SQL> SELECT
  2  	      BktCnt
  3  	     ,PopBktCnt
  4  	     ,PopValCnt
  5  	     ,NDV
  6  	     ,pop_bucketSize
  7  	     ,trunc(((BktCnt-PopBktCnt)/BktCnt)/(NDV-PopValCnt),10) NewDensity
  8  	  FROM
  9  	     (SELECT
 10  		COUNT(1) PopValCnt,
 11  		SUM(endpoint_repeat_count) PopBktCnt,
 12  		ndv,
 13  		BktCnt,
 14  		pop_bucketSize
 15  	      FROM
 16  	       (SELECT
 17  		 (sample_size - num_nulls) BktCnt,
 18  		 num_distinct ndv,
 19  		 num_buckets,
 20  		 density OldDensity,
 21  		 (sample_size-num_nulls)/num_buckets pop_bucketSize
 22  	       FROM user_tab_col_statistics
 23  	       WHERE
 24  		   table_name  = 'T1'
 25  	       AND column_name = 'N1'
 26  	       ),
 27  	       user_histograms
 28  	     WHERE table_name	      = 'T1'
 29  	     AND column_name	      = 'N1'
 30  	     AND endpoint_repeat_count> pop_bucketSize
 31  	     GROUP BY ndv,
 32  	       BktCnt,
 33  	       pop_bucketSize
 34  	     );

    BKTCNT  POPBKTCNT  POPVALCNT        NDV POP_BUCKETSIZE NEWDENSITY
---------- ---------- ---------- ---------- -------------- ----------
       100         33          5         37              5   .0209375

Elapsed: 00:00:00.19
SQL>

We now see if our calculation returned a valid value by 'plugging' it into the following equation: E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size). The cardinality estimate we compute is:


SQL> --
SQL> -- 'Plugging' this new density into the equation above produces:
SQL> --
SQL> -- E-Rows = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size)
SQL> -- E-Rows = 100 * greatest (.0209375, 1/100) = 2.09375 ~ 2
SQL> --
SQL> -- which matches the value returned by Oracle
SQL> --
SQL>
SQL> --
SQL> -- To validate these findings use a different non-popular value
SQL> -- that's also an endpoint:
SQL> --
SQL> explain plan for select count(1) from t1 where n1 = 43;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     3 |     9 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - filter("N1"=43)

14 rows selected.

Elapsed: 00:00:00.02
SQL>

Oracle returned 3 as the estimated cardinality and so did our calculation, more proof that our concept of the Hybrid histogram is correct. One final test will prove how well we understand this histogram, considering non-popular values that are not endpoints. Using our calculated value of NewDensity in this equation: E-Rows = num_rows * NewDensity gives us the following result: E-Rows = 100 * .0209375 = 2.09375 ~ 2. Time to see if our calculated value matches that which Oracle has estimated:


SQL>
SQL> --
SQL> -- Case 3: Non-popular value without an endpoint number
SQL> --
SQL> -- This calculation is fairly simple and straightforward:
SQL> --
SQL> -- E-Rows = num_rows * NewDensity = 100 * .0209375 = 2.09375 ~ 2
SQL> --
SQL>
SQL> explain plan for select count(1) from t1 where n1 = 17;

Explained.

Elapsed: 00:00:00.00
SQL>
SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T1   |     2 |     6 |     3   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - filter("N1"=17)

14 rows selected.

Elapsed: 00:00:00.02
SQL>

Our calculated value, as evidenced by the Rows value in the execution plan, matches Oracle's estimate, giving further evidence that our contcspt of the Hybrid histogram is sound.

`

As mentioned earlier in this article Hybrid histograms can exhibit instability in the form of endpoint changes based on the changing frequency of values as inserts are executed againt the table. To illustrate this let's add another 16 to the data and see what happens:


SQL>
SQL> --
SQL> -- Hybrid histograms can display instability in terms of
SQL> -- endpoints and bucket size
SQL> --
SQL> -- To illustrate this we add 16 to the data, increasing
SQL> -- the repeat count to 3, causing this newly added 16
SQL> -- to shift into the previous bucket leaving 17
SQL> -- as the new endpoint
SQL> --
SQL> insert into t1 values (2, 16);

1 row created.

Elapsed: 00:00:00.00
SQL>
SQL> BEGIN
  2  	dbms_stats.gather_table_stats
  3  	  (user, 't1', method_opt => 'for all columns size 20');
  4  END;
  5  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.08
SQL>
SQL> select
  2  	      uth.endpoint_number
  3  	     ,uth.endpoint_actual_value
  4  	     ,uth.endpoint_repeat_count
  5  	 from
  6  	    user_tab_histograms uth
  7  	    ,user_tables ut
  8  	    ,user_tab_col_statistics ucs
  9  	where
 10  	     uth.table_name    = 'T1'
 11  	    and uth.column_name   = 'N1'
 12  	    and uth.table_name	  = ut.table_name
 13  	    and ut.table_name	  = ucs.table_name
 14  	    and uth.column_name   = ucs.column_name
 15  	   order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
              1 8                                1
              6 13                               3
             11 17                               1 <----
             16 19                               3
             21 20                               5
             27 23                               2
             33 26                               3
             39 27                               6
             45 28                               6
             51 29                               6
             59 31                               5

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
             70 33                               8
             80 35                               7
             87 38                               5
             91 41                               1
             96 43                               3
             97 44                               1
             98 45                               1
             99 46                               1
            101 59                               1

20 rows selected.

Elapsed: 00:00:00.24

The endpoint value of 18 has been replaced by the value 17, because of Rule 2, that no value can span buckets. Before the insert the histogram bucket looked like this: 8 12 12 13 13 13|15 16 16 17 18 18|19 19 19 20 20 20 20 20| ... After the insert the ordered value list looked like this: 8 12 12 13 13|13 15 16 16 16|17 18 18 19 19 19 ... After Oracle 'walks' through the list to consolidate like values the buckets look like this: 8 12 12 13 13 13| 15 16 16 16 17 |18 18 19 19 19... giving a new endpoint number of 17. If we add yet another 16 to the data set we see:


SQL>
SQL> --
SQL> -- Add one more 16 and the endpoint shifts to 16
SQL> --
SQL> insert into t1 values (3, 16);

1 row created.

Elapsed: 00:00:00.00
SQL>
SQL> BEGIN
  2  	   dbms_stats.gather_table_stats
  3  	       (user, 't1', method_opt => 'for all columns size 20');
  4  	 END;
  5  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> select
  2  	      uth.endpoint_number
  3  	     ,uth.endpoint_actual_value
  4  	     ,uth.endpoint_repeat_count
  5  	 from
  6  	    user_tab_histograms uth
  7  	    ,user_tables ut
  8  	    ,user_tab_col_statistics ucs
  9  	where
 10  	     uth.table_name    = 'T1'
 11  	    and uth.column_name   = 'N1'
 12  	    and uth.table_name	  = ut.table_name
 13  	    and ut.table_name	  = ucs.table_name
 14  	    and uth.column_name   = ucs.column_name
 15  	   order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
              1 8                                1
              6 13                               3
             11 16                               4 <-----
             17 19                               3
             22 20                               5
             28 23                               2
             34 26                               3
             40 27                               6
             46 28                               6
             52 29                               6
             60 31                               5

ENDPOINT_NUMBER ENDPOINT_ACT ENDPOINT_REPEAT_COUNT
--------------- ------------ ---------------------
             71 33                               8
             81 35                               7
             88 38                               5
             92 41                               1
             97 43                               3
             98 44                               1
             99 45                               1
            100 46                               1
            102 59                               1

20 rows selected.

Elapsed: 00:00:00.22
SQL>

The endpoint has shifted again and is now 16.

Hybrid histograms combine the accuracy of frequency histograms along with the space savings provided by height-balanced histograms by the mechanism of consolidating all like values into a single bucket. This generates more values that are considered 'popular' by the optimizer, which in turn offers the optimizer a better opportunity to generate an optimal execution path. Even with the demonstrated instability of endpoint values the Optimizer is more likely to produce a 'good' execution plan using the Hybrid histogram. It's a histogram option worth serious consideration.

See all articles by David Fitzjarrell



Oracle Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Oracle Data Mining: Classification jan.hasller 0 July 5th, 07:19 AM
Find duplicates - Unique IDs Lava 5 July 2nd, 08:30 AM
no matching unique or primary key rcanter 1 April 25th, 12:32 PM
Update values of one table based on condition of values in other table using Trigger Gladiator 3 February 29th, 06:01 PM