Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Tips
» Database Forum
» Slideshows
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

Oracle

Posted May 30, 2017

WEBINAR: On-demand webcast

How to Boost Database Development Productivity on Linux, Docker, and Kubernetes with Microsoft SQL Server 2017 REGISTER >

Oracle's Histograms: Bane or Boon?

By David Fitzjarrell

Histograms, in Oracle or most any other relational database, may be one of the most misunderstood and as a result, underused features. Tales abound of performance problems that histograms create by changing once-good plans to terribly inefficient and wasteful resource hogs. Honestly when the histogram data accurately reflects the data distribution nothing could be further from the truth. Let's explore histograms and what they can do for a query.

By default, the Oracle optimizer believes, rightly or wrongly, that your data is evenly distributed across keys. There are occasions when such assumptions are accurate, yet there are also occasions when those assumptions aren't and I expect that the majority of the cases fall into that second category. Using a simple example let's show what can happen with and without histograms. In this simple case, a modern feature of Oracle, dynamic sampling, has been turned off to provide a more 'level' playing field and to generate plans older versions of Oracle generated for this same example (the example has been adapted from work done with Oracle 9iR2 some years ago; turning off dynamic sampling brings us back, in a limited form, to the behavior of that optimizer version). Let's 'dig in' to how the optimizer looks at your data and how that can be modified using histograms.

As stated earlier the Oracle optimizer bases what it does upon the assumption of your data having a perfectly even distribution across all of the desired keys. Since that is not likely to be an accurate description of the data distribution, the optimizer will get this wrong and may choose a full table scan when an index would be appropriate, or choose an index scan when that actually creates more work for the database. Histograms address this by providing a more accurate image of how the data is distributed. Data ranges are grouped into units called 'buckets', and how many of those get generated depends on the data and the type of histogram created. By default, DBMS_STATS computes single-bucket histograms with the lowest endpoint value being the minimum value in the column and the highest endpoint value being the column maximum, so that the rest of the data falls between these two values. Imagine having 10,000 marbles, 9 being unique in color and 9,991 being blue. Now put all of those marbles into one bucket and see how easy it is to find, say, a pink marble. It's a daunting task. With skewed data such as this a 'better' histogram is necessary. The beauty of this example is that even though we have 10,000 marbles we need only 10 'buckets' to accurately describe the data distribution.

Moving our marbles to the database let's create a table, TEST_TAB, with the following extremely descriptive column names:

Table TEST_TAB
Name                            Null? Type
------------------------------- ----- ---------
A                                     NUMBER(6)
B                                     NUMBER(6)

Now we insert into the table our skewed data set; Column A will contain distinct values from 1 to 10000, while Column B will contain 10 distinct values: 1, 2, 3, 4, 5, 9996, 9997, 9998, 9999 and 10000. The value 5 (our 'blue' marble) will occur 9991 times in the data; all other values will occur only once.

The following queries will be run against this data:

(1) select * from test_tab where b=5;

(2) select * from test_tab where b=3;

Let's see what plans the optimizer has for us:


SQL>
SQL> set autotrace on linesize 132
SQL>
SQL> select * from test_tab where b=5;

         A          B
---------- ----------
      2430          5
      2431          5
      2432          5
      2433          5
      2434          5
      2435          5
      2436          5
      2437          5
      2438          5
      2439          5
      2440          5
...
      9993          5
      9994          5
      9995          5

9991 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |     1 |    26 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_TAB |     1 |    26 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("B"=5)


Statistics
----------------------------------------------------------
          6  recursive calls
          0  db block gets
        694  consistent gets
          0  physical reads
          0  redo size
     212182  bytes sent via SQL*Net to client
       7849  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9991  rows processed

SQL>
SQL> select * from test_tab where b=3;

         A          B
---------- ----------
         3          3


Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |     1 |    26 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_TAB |     1 |    26 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("B"=3)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
         24  consistent gets
          0  physical reads
          0  redo size
        585  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>
SQL> set autotrace off

Those results should be no surprise as no indexes have been created so the optimizer has no choice but to scan the table. Let's change that by creating an index on column B:


SQL>
SQL> create index test_tab_idx
  2  on test_tab(b)
  3  tablespace indx;

Index created.

SQL>

Checking USER_TAB_HISTOGRAMS we see there are no histograms on TEST_TAB, but there are also no statistics on that table, either:


SQL>
SQL> select table_name, column_name, endpoint_number, endpoint_value
  2  from user_tab_histograms
  3  where table_name = 'TEST_TAB';

no rows selected

SQL>

We have an index and no histograms, let's see what the optimizer does:


SQL>
SQL>
SQL> set autotrace on linesize 132
SQL>
SQL> select * from test_tab where b=5;

         A          B
---------- ----------
      2430          5
      2431          5
      2432          5
      2433          5
      2434          5
      2435          5
      2436          5
      2437          5
      2438          5
      2439          5
      2440          5
...
      9993          5
      9994          5
      9995          5

9991 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |    26 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |     1 |    26 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | TEST_TAB_IDX |     1 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"=5)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       1371  consistent gets
         20  physical reads
          0  redo size
     244152  bytes sent via SQL*Net to client
       7849  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9991  rows processed

SQL>
SQL> select * from test_tab where b=3;

         A          B
---------- ----------
         3          3


Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |    26 |     1   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |     1 |    26 |     1   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | TEST_TAB_IDX |     1 |       |     2   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"=3)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        589  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>
SQL> set autotrace off

So now both tables are using the index, and that's not necessarily good as 99+% of the table data is associated with the value 5 in column B. Let's finally generate some statistics on TEST_TAB and create a frequency histogram while we're at it:


SQL>
SQL>
SQL> exec dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'TEST_TAB', method_opt => 'for all indexed columns size auto', cascade=>TRUE);

PL/SQL procedure successfully completed.

SQL>
SQL> select table_name, column_name, endpoint_number, endpoint_value
  2  from user_tab_histograms
  3  where table_name = 'TEST_TAB';

TABLE_NAME                     COLUMN_NAME                         ENDPOINT_NUMBER ENDPOINT_VALUE
------------------------------ ----------------------------------- --------------- --------------
TEST_TAB                       B                                                 1              1
TEST_TAB                       B                                                 2              2
TEST_TAB                       B                                                 3              3
TEST_TAB                       B                                                 4              4
TEST_TAB                       B                                              9995              5
TEST_TAB                       B                                              9996           9996
TEST_TAB                       B                                              9997           9997
TEST_TAB                       B                                              9998           9998
TEST_TAB                       B                                              9999           9999
TEST_TAB                       B                                             10000          10000

10 rows selected.

SQL>

Having a limited number of distinct values in the data set allows us to generate such a histogram and it's very beneficial to the optimizer as it describes the data far more accurately than the blind assumptions the optimizer made at the outset. Let's run the queries again and see how Oracle performs:


SQL>
SQL>
SQL> set autotrace on
SQL>
SQL> select * from test_tab where b=5;

         A          B
---------- ----------
      2430          5
      2431          5
      2432          5
      2433          5
      2434          5
      2435          5
      2436          5
      2437          5
      2438          5
      2439          5
      2440          5
...
      9993          5
      9994          5
      9995          5

9991 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          |  9991 | 69937 |     7   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_TAB |  9991 | 69937 |     7   (0)| 00:00:01 |
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("B"=5)


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        689  consistent gets
          0  physical reads
          0  redo size
     212182  bytes sent via SQL*Net to client
       7849  bytes received via SQL*Net from client
        668  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
       9991  rows processed

SQL>
SQL> select * from test_tab where b=3;

         A          B
---------- ----------
         3          3


Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153

--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |     1 |     7 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_TAB     |     1 |     7 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | TEST_TAB_IDX |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("B"=3)


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        589  bytes sent via SQL*Net to client
        523  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

SQL>
SQL> set autotrace off
SQL>

Remember that this example was 'tweaked' to stop dynamic sampling by Oracle; such a feature can provide accurate plans without histograms for simple data (when dynamic sampling was enabled the "b=5" query used a full table scan in the absence of generated statistics and histograms, but also remember that this is fairly simple data and modern production systems generate and query far more complex data sets; dynamic sampling will likely not be enough to generate 'good' execution plans consistently).

Because histograms can be a very important additional step when generating table statistics Oracle has expanded the types available to include, in 12.1.0.2 and later releases, the hybrid histogram. Since I've already written about that construct here I won't repeat myself. What this means is that histograms aren't 'tools of the Evil Empire designed to ensnare your data'; when they are understood, and have representative statistics to back them up they can be very beneficial.

See all articles by David Fitzjarrell



Oracle Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Oracle Data Mining: Classification jan.hasller 0 July 5th, 07:19 AM
Find duplicates - Unique IDs Lava 5 July 2nd, 08:30 AM
no matching unique or primary key rcanter 1 April 25th, 12:32 PM
Update values of one table based on condition of values in other table using Trigger Gladiator 3 February 29th, 06:01 PM