dcsimg
Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
SQL Scripts & Samples
Tips
» Database Forum
» Slideshows
Free Newsletters:



News Via RSS Feed


Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

Oracle

Posted April 15, 2019

Single-column Primary Keys and Histograms in Oracle

By David Fitzjarrell

For every timeless adage there is at least one mistaken thought that can become 'truth' simply by virtue of being repeated. Oracle is no exception to this phenomenon especially when single-column Primary Key indexes are involved. The conventional 'wisdom' discounts the use of histograms on such an index as unnecessary, and this presumes that the primary key has no 'holes' and has a continuous range of data. Very often reality flies in the face of that assumption; even with triggers and sequences, it is highly unlikely to find an unbroken, gap-free data set where that primary key is concerned. Absent a histogram, Oracle is left to its own devices and presumptions which can take an execution path 'into the weeds'. Let's look at what Oracle expects to find, what Oracle can find and how a histogram can help.

Oracle expects data sets to be uniform in distribution when no other information is available; create a table, load some data, leave off indexes and statistics, and Oracle will generate execution plans based on the assumption of uniform data distribution — keys are not clustered, have no gaps or holes and just generally 'play nice'. This, of course, does not reflect the real world which is why Oracle provides ways to generate statistics on table data with and without histograms. In the unique case of a single-column, single-valued key a histogram is likely not necessary, but in many other cases involving single-column keys, especially primary keys, histograms may provide tremendous help in generating a reasonable execution plan. Three of the more 'popular' histogram types are frequency, height-balanced and hybrid, the latter available in versions 12.x and later. Histograms basically draw a 'picture' of the data that Oracle can use to better assess execution steps and produce acceptable plans. Since examples are usually good let's build one for this situation and see how query execution changes based on changing information.

Histograms are based, basically, on data distribution. Oracle records such statistical data as it scans through the table so it can be used later to determine the best route of data access. If a table is created and populated with 1,000 rows containing random numeric values the resulting 'map' should produce no 'hot spots' where one or a handful of values dominate the data distribution. Repeating values in a key will behave differently, as that same 1,000 rows populated with a limited set of values (1 through 22, for example) is guaranteed to produce repeating, non-unique keys. Now let's look at how a primary key is defined and why a histogram could be very helpful. A primary key is:

  • Unique
  • Has no NULL values

which guarantees that each and every key will be populated with usable data. What it does NOT guarantee is that the set of key values:

  • Is sequential
  • Is gapless
  • Has a uniform difference between adjacent key values

That last statement gives rise to the condition of key clustering, groups of values that produce small differences within the group but may experience large 'jumps' between adjacent sets of grouped values, making the data anything but uniformly distributed across the range of key values. For example, if in our example table the key is defined as my_key_num, a number, a very distinct possibility exists that gaps (holes) can be created even when no records are deleted simply due to the nature of how a cached sequence behaves. For this example, a sequence is created that starts with the number 1, increases by 1 for each request for a new value and each instantiation of the sequence (one for each user session using that sequence) caches 20 values. Now presume that 12 user sessions are using that sequence; the first session to fetch data from that sequences fetches the first 20 values, the next session in line allocates the next 20 values, and so on until all 12 sessions have values cached. There is no possible way sequence values allocated in that manner will ever be 'in order', and the actual data may look more like:

	1,41,71,21,31,81,91,61,111,101,121...

than

	1,2,3,4,5,6,7,8,9,10,11,...

since sessions have not been allocated 'overlapping' sequence values. In a RAC environment the ORDERED directive can be included with the sequence that should produce ordered sequence values across sessions but that also requires that caching not be used. Even THEN, failed inserts and rollbacks cause sequences to 'lose' values, creating gaps. So unless this is a single-user system there is no possibility of truly sequential, gap free primary key values.

Building and populating a table based on the above description is where this example begins:


SQL>
SQL> create table pk_hist_test
  2  as
  3  with testdata as (
  4  	     select  --+ materialize
  5  		     rownum myid
  6  	     from dual
  7  	     connect by
  8  		     level <= 1e4
  9  )
 10  select
 11  	     rownum  myid
 12  from
 13  	     testdata	     v1,
 14  	     testdata	     v2
 15  where
 16  	     rownum <= 1e4
 17  ;

Table created.

SQL>
SQL>
SQL> column max_id_1 new_value max1
SQL> column max_id_2 new_value max2
SQL> column max_id_3 new_value max3
SQL>
SQL> select min(myid), max(myid) max_id_1
  2  from pk_hist_test;

 MIN(MYID)   MAX_ID_1
---------- ----------
         1      10000

SQL>

The table has 10,000 rows in numeric order and no gaps. Let's add data, but put a large gap in the key sequence:


SQL> insert into pk_hist_test select myid + 1e6 from pk_hist_test;

10000 rows created.

SQL>
SQL> select min(myid), max(myid) max_id_2
  2  from pk_hist_test;

 MIN(MYID)   MAX_ID_2
---------- ----------
         1    1010000

SQL>

Let's do that again:


SQL>
SQL> insert into pk_hist_test select myid + 1e7 from pk_hist_test;

20000 rows created.

SQL>
SQL> select min(myid), max(myid) max_id_3
  2  from pk_hist_test;

 MIN(MYID)   MAX_ID_3
---------- ----------
         1   11010000

SQL>

Add a primary key to the table:


SQL>
SQL> alter table pk_hist_test add constraint pk_hist_test_pk primary key(myid);

Table altered.

SQL>

Oracle has generated the following basic statistics on the data, based on the primary key index:


SQL>
SQL> column column_name format a35
SQL> set linesize 150
SQL>
SQL> select
  2  	     column_name, sample_size,
  3  	     num_distinct, num_nulls, density,
  4  	     histogram, num_buckets
  5  from
  6  	     user_tab_cols
  7  where
  8  	     table_name = 'PK_HIST_TEST'
  9  order by
 10  	     column_name
 11  ;

COLUMN_NAME                         SAMPLE_SIZE NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM       NUM_BUCKETS
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------
MYID                                      10000        10000          0      .0001 NONE                      1

SQL>

Notice no histogram exists; querying the data to see what Oracle predicts for the number of returned rows:


SQL>
SQL> set autotrace on
SQL> select
  2  	     *
  3  from    pk_hist_test
  4  where
  5  	     myid between 10001 and 1010017
  6  ;

      MYID
----------
   1000001
   1000002
   1000003
   1000004
   1000005
   1000006
   1000007
   1000008
   1000009
   1000010
   1000011
...
   1009998
   1009999
   1010000

10000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1838269289

------------------------------------------------------------------------------------
| Id  | Operation        | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |                 |     1 |     4 |     2   (0)| 00:00:01 |
|*  1 |  INDEX RANGE SCAN| PK_HIST_TEST_PK |     1 |     4 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("MYID">=10001 AND "MYID"<=1010017)


Statistics
----------------------------------------------------------
         17  recursive calls
          0  db block gets
        703  consistent gets
         23  physical reads
          0  redo size
     199741  bytes sent via SQL*Net to client
       7878  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          2  sorts (memory)
          0  sorts (disk)
      10000  rows processed

SQL> set autotrace off

The query returned 10,000 rows but the optimizer only expected 1 row to be returned. Now create a histogram on the data and try the same query again:


SQL>
SQL> begin
  2  	     dbms_stats.gather_table_stats(
  3  		     ownname	      => user,
  4  		     tabname	      =>'PK_HIST_TEST',
  5  		     estimate_percent => null
  6  	     );
  7  end;
  8  /

PL/SQL procedure successfully completed.

SQL>
SQL> set autotrace on
SQL> select
  2  	     *
  3  from    pk_hist_test
  4  where
  5  	     myid between 10001 and 1010017
  6  ;

      MYID
----------
   1000001
   1000002
   1000003
   1000004
   1000005
   1000006
   1000007
   1000008
   1000009
   1000010
   1000011
...
   1010000

10000 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

------------------------------------------------------------------------------------
| Id  | Operation         | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                |  9960 | 49800 |    20   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| PK_HIST_TEST   |  9960 | 49800 |    20   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("MYID"<=1010017 AND "MYID">=10001)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        690  consistent gets
          0  physical reads
          0  redo size
     199741  bytes sent via SQL*Net to client
       7878  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
      10000  rows processed

SQL> set autotrace off

The histogram switched the query from an index scan to a full table scan due to the volume of data returned. Notice also the better estimate on the number of rows returned by the query; an estimate aided by the existence of the histogram. Looking at the statistics now we see that a HYBRID histogram exists on the data, created because the number of 'buckets' is far less than the number of distinct values:


SQL>
SQL> column column_name format a35
SQL> set linesize 150
SQL>
SQL> select
  2  	     column_name, sample_size,
  3  	     num_distinct, num_nulls, density,
  4  	     histogram, num_buckets
  5  from
  6  	     user_tab_cols
  7  where
  8  	     table_name = 'PK_HIST_TEST'
  9  order by
 10  	     column_name
 11  ;

COLUMN_NAME                         SAMPLE_SIZE NUM_DISTINCT  NUM_NULLS    DENSITY HISTOGRAM       NUM_BUCKETS
----------------------------------- ----------- ------------ ---------- ---------- --------------- -----------
MYID                                      40000        40000          0    .000025 HYBRID                  254

SQL>

Contrary to popular belief, creating a histogram on a primary key is not such an outlandish thought. Even though the index access was 'lost', we were expecting to see the plan is better suited to the data since a large portion of the table data is returned. Also contrary to popular belief — table scans are not 'evil' as they can reduce the overall I/O when large data volumes are involved.

Primary keys, even though they are comprised of unique, non-null data, can benefit from a histogram; it pays to remember that unless your data is in a vacuum (which it isn't) gaps and clusters in primary key data will occur and histograms can help the optimizer by providing a more accurate view of the data distribution within the key. This leads to better execution plans and possibly faster data access, which should make both the DBA team and the end-users happy.

See all articles by David Fitzjarrell



Oracle Archives




Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Oracle Data Mining: Classification jan.hasller 0 July 5th, 07:19 AM
Find duplicates - Unique IDs Lava 5 July 2nd, 08:30 AM
no matching unique or primary key rcanter 1 April 25th, 12:32 PM
Update values of one table based on condition of values in other table using Trigger Gladiator 3 February 29th, 06:01 PM