Histograms, in Oracle or most any other relational database, may be one of the most misunderstood and as a result, underused features. Tales abound of performance problems that histograms create by changing once-good plans to terribly inefficient and wasteful resource hogs. Honestly when the histogram data accurately reflects the data distribution nothing could be further from the truth. Let’s explore histograms and what they can do for a query.
By default, the Oracle optimizer believes, rightly or wrongly, that your data is evenly distributed across keys. There are occasions when such assumptions are accurate, yet there are also occasions when those assumptions aren’t and I expect that the majority of the cases fall into that second category. Using a simple example let’s show what can happen with and without histograms. In this simple case, a modern feature of Oracle, dynamic sampling, has been turned off to provide a more ‘level’ playing field and to generate plans older versions of Oracle generated for this same example (the example has been adapted from work done with Oracle 9iR2 some years ago; turning off dynamic sampling brings us back, in a limited form, to the behavior of that optimizer version). Let’s ‘dig in’ to how the optimizer looks at your data and how that can be modified using histograms.
As stated earlier the Oracle optimizer bases what it does upon the assumption of your data having a perfectly even distribution across all of the desired keys. Since that is not likely to be an accurate description of the data distribution, the optimizer will get this wrong and may choose a full table scan when an index would be appropriate, or choose an index scan when that actually creates more work for the database. Histograms address this by providing a more accurate image of how the data is distributed. Data ranges are grouped into units called ‘buckets’, and how many of those get generated depends on the data and the type of histogram created. By default, DBMS_STATS computes single-bucket histograms with the lowest endpoint value being the minimum value in the column and the highest endpoint value being the column maximum, so that the rest of the data falls between these two values. Imagine having 10,000 marbles, 9 being unique in color and 9,991 being blue. Now put all of those marbles into one bucket and see how easy it is to find, say, a pink marble. It’s a daunting task. With skewed data such as this a ‘better’ histogram is necessary. The beauty of this example is that even though we have 10,000 marbles we need only 10 ‘buckets’ to accurately describe the data distribution.
Moving our marbles to the database let’s create a table, TEST_TAB, with the following extremely descriptive column names:
Table TEST_TAB
Name Null? Type
------------------------------- ----- ---------
A NUMBER(6)
B NUMBER(6)
Now we insert into the table our skewed data set; Column A will contain distinct values from 1 to 10000, while Column B will contain 10 distinct values: 1, 2, 3, 4, 5, 9996, 9997, 9998, 9999 and 10000. The value 5 (our ‘blue’ marble) will occur 9991 times in the data; all other values will occur only once.
The following queries will be run against this data:
(1) select * from test_tab where b=5;
(2) select * from test_tab where b=3;
Let’s see what plans the optimizer has for us:
SQL>
SQL> set autotrace on linesize 132
SQL>
SQL> select * from test_tab where b=5;
A B
---------- ----------
2430 5
2431 5
2432 5
2433 5
2434 5
2435 5
2436 5
2437 5
2438 5
2439 5
2440 5
...
9993 5
9994 5
9995 5
9991 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 2 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_TAB | 1 | 26 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("B"=5)
Statistics
----------------------------------------------------------
6 recursive calls
0 db block gets
694 consistent gets
0 physical reads
0 redo size
212182 bytes sent via SQL*Net to client
7849 bytes received via SQL*Net from client
668 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
9991 rows processed
SQL>
SQL> select * from test_tab where b=3;
A B
---------- ----------
3 3
Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 2 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_TAB | 1 | 26 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("B"=3)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
24 consistent gets
0 physical reads
0 redo size
585 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL> set autotrace off
Those results should be no surprise as no indexes have been created so the optimizer has no choice but to scan the table. Let’s change that by creating an index on column B:
SQL>
SQL> create index test_tab_idx
2 on test_tab(b)
3 tablespace indx;
Index created.
SQL>
Checking USER_TAB_HISTOGRAMS we see there are no histograms on TEST_TAB, but there are also no statistics on that table, either:
SQL>
SQL> select table_name, column_name, endpoint_number, endpoint_value
2 from user_tab_histograms
3 where table_name = 'TEST_TAB';
no rows selected
SQL>
We have an index and no histograms, let’s see what the optimizer does:
SQL>
SQL>
SQL> set autotrace on linesize 132
SQL>
SQL> select * from test_tab where b=5;
A B
---------- ----------
2430 5
2431 5
2432 5
2433 5
2434 5
2435 5
2436 5
2437 5
2438 5
2439 5
2440 5
...
9993 5
9994 5
9995 5
9991 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TEST_TAB | 1 | 26 | 1 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TEST_TAB_IDX | 1 | | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("B"=5)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
1371 consistent gets
20 physical reads
0 redo size
244152 bytes sent via SQL*Net to client
7849 bytes received via SQL*Net from client
668 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
9991 rows processed
SQL>
SQL> select * from test_tab where b=3;
A B
---------- ----------
3 3
Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TEST_TAB | 1 | 26 | 1 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TEST_TAB_IDX | 1 | | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("B"=3)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
589 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL> set autotrace off
So now both tables are using the index, and that’s not necessarily good as 99+% of the table data is associated with the value 5 in column B. Let’s finally generate some statistics on TEST_TAB and create a frequency histogram while we’re at it:
SQL>
SQL>
SQL> exec dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'TEST_TAB', method_opt => 'for all indexed columns size auto', cascade=>TRUE);
PL/SQL procedure successfully completed.
SQL>
SQL> select table_name, column_name, endpoint_number, endpoint_value
2 from user_tab_histograms
3 where table_name = 'TEST_TAB';
TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
------------------------------ ----------------------------------- --------------- --------------
TEST_TAB B 1 1
TEST_TAB B 2 2
TEST_TAB B 3 3
TEST_TAB B 4 4
TEST_TAB B 9995 5
TEST_TAB B 9996 9996
TEST_TAB B 9997 9997
TEST_TAB B 9998 9998
TEST_TAB B 9999 9999
TEST_TAB B 10000 10000
10 rows selected.
SQL>
Having a limited number of distinct values in the data set allows us to generate such a histogram and it’s very beneficial to the optimizer as it describes the data far more accurately than the blind assumptions the optimizer made at the outset. Let’s run the queries again and see how Oracle performs:
SQL>
SQL>
SQL> set autotrace on
SQL>
SQL> select * from test_tab where b=5;
A B
---------- ----------
2430 5
2431 5
2432 5
2433 5
2434 5
2435 5
2436 5
2437 5
2438 5
2439 5
2440 5
...
9993 5
9994 5
9995 5
9991 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3962208483
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 9991 | 69937 | 7 (0)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TEST_TAB | 9991 | 69937 | 7 (0)| 00:00:01 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("B"=5)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
689 consistent gets
0 physical reads
0 redo size
212182 bytes sent via SQL*Net to client
7849 bytes received via SQL*Net from client
668 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
9991 rows processed
SQL>
SQL> select * from test_tab where b=3;
A B
---------- ----------
3 3
Execution Plan
----------------------------------------------------------
Plan hash value: 1487393153
--------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| TEST_TAB | 1 | 7 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | TEST_TAB_IDX | 1 | | 1 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("B"=3)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
4 consistent gets
0 physical reads
0 redo size
589 bytes sent via SQL*Net to client
523 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
SQL> set autotrace off
SQL>
Remember that this example was ‘tweaked’ to stop dynamic sampling by Oracle; such a feature can provide accurate plans without histograms for simple data (when dynamic sampling was enabled the “b=5” query used a full table scan in the absence of generated statistics and histograms, but also remember that this is fairly simple data and modern production systems generate and query far more complex data sets; dynamic sampling will likely not be enough to generate ‘good’ execution plans consistently).
Because histograms can be a very important additional step when generating table statistics Oracle has expanded the types available to include, in 12.1.0.2 and later releases, the hybrid histogram. Since I’ve already written about that construct here I won’t repeat myself. What this means is that histograms aren’t ‘tools of the Evil Empire designed to ensnare your data’; when they are understood, and have representative statistics to back them up they can be very beneficial.