For every timeless adage there is at least one mistaken thought that can become ‘truth’ simply by virtue of being repeated. Oracle is no exception to this phenomenon especially when single-column Primary Key indexes are involved. The conventional ‘wisdom’ discounts the use of histograms on such an index as unnecessary, and this presumes that the primary key has no ‘holes’ and has a continuous range of data. Very often reality flies in the face of that assumption; even with triggers and sequences, it is highly unlikely to find an unbroken, gap-free data set where that primary key is concerned. Absent a histogram, Oracle is left to its own devices and presumptions which can take an execution path ‘into the weeds’. Let’s look at what Oracle expects to find, what Oracle can find and how a histogram can help.
Oracle expects data sets to be uniform in distribution when no other information is available; create a table, load some data, leave off indexes and statistics, and Oracle will generate execution plans based on the assumption of uniform data distribution — keys are not clustered, have no gaps or holes and just generally ‘play nice’. This, of course, does not reflect the real world which is why Oracle provides ways to generate statistics on table data with and without histograms. In the unique case of a single-column, single-valued key a histogram is likely not necessary, but in many other cases involving single-column keys, especially primary keys, histograms may provide tremendous help in generating a reasonable execution plan. Three of the more ‘popular’ histogram types are frequency, height-balanced and hybrid, the latter available in versions 12.x and later. Histograms basically draw a ‘picture’ of the data that Oracle can use to better assess execution steps and produce acceptable plans. Since examples are usually good let’s build one for this situation and see how query execution changes based on changing information.
Histograms are based, basically, on data distribution. Oracle records such statistical data as it scans through the table so it can be used later to determine the best route of data access. If a table is created and populated with 1,000 rows containing random numeric values the resulting ‘map’ should produce no ‘hot spots’ where one or a handful of values dominate the data distribution. Repeating values in a key will behave differently, as that same 1,000 rows populated with a limited set of values (1 through 22, for example) is guaranteed to produce repeating, non-unique keys. Now let’s look at how a primary key is defined and why a histogram could be very helpful. A primary key is:
- Unique
- Has no NULL values
which guarantees that each and every key will be populated with usable data. What it does NOT guarantee is that the set of key values:
- Is sequential
- Is gapless
- Has a uniform difference between adjacent key values
That last statement gives rise to the condition of key clustering, groups of values that produce small differences within the group but may experience large ‘jumps’ between adjacent sets of grouped values, making the data anything but uniformly distributed across the range of key values. For example, if in our example table the key is defined as my_key_num, a number, a very distinct possibility exists that gaps (holes) can be created even when no records are deleted simply due to the nature of how a cached sequence behaves. For this example, a sequence is created that starts with the number 1, increases by 1 for each request for a new value and each instantiation of the sequence (one for each user session using that sequence) caches 20 values. Now presume that 12 user sessions are using that sequence; the first session to fetch data from that sequences fetches the first 20 values, the next session in line allocates the next 20 values, and so on until all 12 sessions have values cached. There is no possible way sequence values allocated in that manner will ever be ‘in order’, and the actual data may look more like:
1,41,71,21,31,81,91,61,111,101,121...
than
1,2,3,4,5,6,7,8,9,10,11,...
since sessions have not been allocated ‘overlapping’ sequence values. In a RAC environment the ORDERED directive can be included with the sequence that should produce ordered sequence values across sessions but that also requires that caching not be used. Even THEN, failed inserts and rollbacks cause sequences to ‘lose’ values, creating gaps. So unless this is a single-user system there is no possibility of truly sequential, gap free primary key values.
Building and populating a table based on the above description is where this example begins:
SQL>
SQL> create table pk_hist_test
2 as
3 with testdata as (
4 select --+ materialize
5 rownum myid
6 from dual
7 connect by
8 level <= 1e4
9 )
10 select
11 rownum myid
12 from
13 testdata v1,
14 testdata v2
15 where
16 rownum <= 1e4
17 ;
Table created.
SQL>
SQL>
SQL> column max_id_1 new_value max1
SQL> column max_id_2 new_value max2
SQL> column max_id_3 new_value max3
SQL>
SQL> select min(myid), max(myid) max_id_1
2 from pk_hist_test;
MIN(MYID) MAX_ID_1
---------- ----------
1 10000
SQL>
The table has 10,000 rows in numeric order and no gaps. Let’s add data, but put a large gap in the key sequence:
SQL> insert into pk_hist_test select myid + 1e6 from pk_hist_test;
10000 rows created.
SQL>
SQL> select min(myid), max(myid) max_id_2
2 from pk_hist_test;
MIN(MYID) MAX_ID_2
---------- ----------
1 1010000
SQL>
Let’s do that again:
SQL>
SQL> insert into pk_hist_test select myid + 1e7 from pk_hist_test;
20000 rows created.
SQL>
SQL> select min(myid), max(myid) max_id_3
2 from pk_hist_test;
MIN(MYID) MAX_ID_3
---------- ----------
1 11010000
SQL>
Add a primary key to the table:
SQL>
SQL> alter table pk_hist_test add constraint pk_hist_test_pk primary key(myid);
Table altered.
SQL>
Oracle has generated the following basic statistics on the data, based on the primary key index:
SQL>
SQL> column column_name format a35
SQL> set linesize 150
SQL>
SQL> select
2 column_name, sample_size,
3 num_distinct, num_nulls, density,
4 histogram, num_buckets
5 from
6 user_tab_cols
7 where
8 table_name = 'PK_HIST_TEST'
9 order by
10 column_name
11 ;
COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM NUM_BUCKETS
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------
MYID 10000 10000 0 .0001 NONE 1
SQL>
Notice no histogram exists; querying the data to see what Oracle predicts for the number of returned rows:
SQL>
SQL> set autotrace on
SQL> select
2 *
3 from pk_hist_test
4 where
5 myid between 10001 and 1010017
6 ;
MYID
----------
1000001
1000002
1000003
1000004
1000005
1000006
1000007
1000008
1000009
1000010
1000011
...
1009998
1009999
1010000
10000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1838269289
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 4 | 2 (0)| 00:00:01 |
|* 1 | INDEX RANGE SCAN| PK_HIST_TEST_PK | 1 | 4 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - access("MYID">=10001 AND "MYID"<=1010017)
Statistics
----------------------------------------------------------
17 recursive calls
0 db block gets
703 consistent gets
23 physical reads
0 redo size
199741 bytes sent via SQL*Net to client
7878 bytes received via SQL*Net from client
668 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
10000 rows processed
SQL> set autotrace off
The query returned 10,000 rows but the optimizer only expected 1 row to be returned. Now create a histogram on the data and try the same query again:
SQL>
SQL> begin
2 dbms_stats.gather_table_stats(
3 ownname => user,
4 tabname =>'PK_HIST_TEST',
5 estimate_percent => null
6 );
7 end;
8 /
PL/SQL procedure successfully completed.
SQL>
SQL> set autotrace on
SQL> select
2 *
3 from pk_hist_test
4 where
5 myid between 10001 and 1010017
6 ;
MYID
----------
1000001
1000002
1000003
1000004
1000005
1000006
1000007
1000008
1000009
1000010
1000011
...
1010000
10000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013
------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9960 | 49800 | 20 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| PK_HIST_TEST | 9960 | 49800 | 20 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("MYID"<=1010017 AND "MYID">=10001)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
690 consistent gets
0 physical reads
0 redo size
199741 bytes sent via SQL*Net to client
7878 bytes received via SQL*Net from client
668 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
10000 rows processed
SQL> set autotrace off
The histogram switched the query from an index scan to a full table scan due to the volume of data returned. Notice also the better estimate on the number of rows returned by the query; an estimate aided by the existence of the histogram. Looking at the statistics now we see that a HYBRID histogram exists on the data, created because the number of ‘buckets’ is far less than the number of distinct values:
SQL>
SQL> column column_name format a35
SQL> set linesize 150
SQL>
SQL> select
2 column_name, sample_size,
3 num_distinct, num_nulls, density,
4 histogram, num_buckets
5 from
6 user_tab_cols
7 where
8 table_name = 'PK_HIST_TEST'
9 order by
10 column_name
11 ;
COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT NUM_NULLS DENSITY HISTOGRAM NUM_BUCKETS
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------
MYID 40000 40000 0 .000025 HYBRID 254
SQL>
Contrary to popular belief, creating a histogram on a primary key is not such an outlandish thought. Even though the index access was ‘lost’, we were expecting to see the plan is better suited to the data since a large portion of the table data is returned. Also contrary to popular belief — table scans are not ‘evil’ as they can reduce the overall I/O when large data volumes are involved.
Primary keys, even though they are comprised of unique, non-null data, can benefit from a histogram; it pays to remember that unless your data is in a vacuum (which it isn’t) gaps and clusters in primary key data will occur and histograms can help the optimizer by providing a more accurate view of the data distribution within the key. This leads to better execution plans and possibly faster data access, which should make both the DBA team and the end-users happy.