Extended statistics can be very helpful to the optimizer by establishing a dependency between two or more columns in a table. Such dependencies can improve cardinality estimates the optimizer uses to determine the best plan of attack for a query. Extended statistics have been available since 11.2.0.x, and now Oracle 12.1.0.2 improves on that by generating extended statistics automatically. Using a feature called Automatic Column Group detection, Oracle can recognize column dependencies and create extended statistics on those recognized groups. Let’s look at how this can be done and what benefit it can provide.
Oracle provides a procedure in the DBMS_STATS package that tells the database to monitor column usage, including recognizing any column groups in use. The procedure is simple to run, requiring a SQL set name, an owner and a time limit, any or all of which can be NULL. If the sql set and owner are NULL then Oracle monitors all column usage. In this example a 5-minute time limit is set:
exec dbms_stats.seed_col_usage(null,null,300)
Let’s create a table with dependent columns; city/state/country set of columns so that state is dependent on country and city is dependent on state:
SQL>
SQL> --
SQL> -- Create tables and sequence
SQL> --
SQL> create sequence cust_seq start with 1 increment by 1 nocycle nomaxvalue nocache;
Sequence created.
SQL>
SQL> create table cust (
2 cust_id number not null,
3 cust_nm varchar2(80),
4 cust_city varchar2(50),
5 cust_state varchar2(2),
6 cust_country varchar2(2));
Table created.
SQL>
SQL> create table country(
2 country_nm varchar2(80),
3 country_cd2 varchar2(2),
4 country_cd3 varchar2(3));
Table created.
SQL>
Load data into the tables; for the COUNTRY table SQL*Loader is used, and for the CUST table a series of 29000+ inserts are executed (the output of the inserts is not shown for space reasons). Additionally primary and foreign key constraints were created to ensure referential integrity:
SQL>
SQL> --
SQL> -- Load country data
SQL> --
SQL> $sqlldr bing/#######@*********** country.ctl
SQL>
SQL> --
SQL> -- Create primary key and foreign key constraints
SQL> --
SQL> -- Ensures no invalid country codes are entered
SQL> --
SQL> alter table country add constraint country_pk primary key(country_cd2);
Table altered.
SQL>
SQL> alter table cust add constraint cust_country_fk foreign key(cust_country) references country(country_cd2);
Table altered.
SQL>
SQL> --
SQL> -- Load the customer data
SQL> --
SQL> @ins_cust_data
...
SQL>
An index was created on the CUST table to speed access to the data:
SQL>
SQL> --
SQL> -- Index CUST table for easier access
SQL> --
SQL> create index cust_idx on cust(cust_state, cust_country);
Index created.
SQL>
Statistics are gathered on the CUST table:
SQL>
SQL> --
SQL> -- Gather initial statistics
SQL> --
SQL> exec dbms_stats.gather_table_stats(user, 'CUST')
PL/SQL procedure successfully completed.
SQL>
Two queries are run that relate the cust_state column with the cust_country column. Plan statistics are gathered on the second run and the plan is displayed:
SQL>
SQL> --
SQL> -- Execute two queries to see how the optimizer
SQL> -- fares with the existing statistics
SQL> --
SQL> select count(*)
2 from cust
3 where cust_state = 'MT' and cust_country = 'US'
4 /
COUNT(*)
----------
315
SQL>
SQL> select /*+ gather_plan_statistics */ count(*)
2 from cust
3 where cust_state = 'MT' and cust_country = 'US'
4 /
COUNT(*)
----------
315
SQL>
SQL> --
SQL> -- Generate the plan to see how far off the estimate is
SQL> -- from the actual number of rows
SQL> --
SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last'));
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 5027bkwuzg1jz, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from cust where
cust_state = 'MT' and cust_country = 'US'
Plan hash value: 2062844691
----------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 3 |
|* 2 | INDEX RANGE SCAN| CUST_IDX | 1 | 578 | 315 |00:00:00.01 | 3 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("CUST_STATE"='MT' AND "CUST_COUNTRY"='US')
20 rows selected.
SQL>
With the current statistics the optimizer isn’t aware of the column grouping so the cardinality estimate is off by almost a factor of 2. Let’s ‘fix’ that by enabling column usage tracking across the database:
SQL>
SQL> --
SQL> -- Enable the automatic extended stats gathering by
SQL> -- recording column usage
SQL> --
SQL> exec dbms_stats.seed_col_usage(null,null,300)
PL/SQL procedure successfully completed.
SQL>
Run the same set of queries again:
SQL>
SQL> --
SQL> -- Run the same queries again
SQL> --
SQL> select count(*)
2 from cust
3 where cust_state = 'MT' and cust_country = 'US'
4 /
COUNT(*)
----------
315
SQL>
SQL> select /*+ gather_plan_statistics */ count(*)
2 from cust
3 where cust_state = 'MT' and cust_country = 'US'
4 /
COUNT(*)
----------
315
SQL>
Gather statistics again on the CUST table, and this time extended statistics will be generated:
SQL>
SQL> --
SQL> -- Generate current statistics
SQL> --
SQL> -- These will include extended statistics
SQL> -- for the recognized column groups
SQL> --
SQL> exec dbms_stats.gather_table_stats(user, 'CUST')
PL/SQL procedure successfully completed.
SQL>
Run the same set of queries one more time, and report the current plan:
SQL>
SQL> --
SQL> -- Run the queries one more time
SQL> --
SQL> select count(*)
2 from cust
3 where cust_state = 'MT' and cust_country = 'US'
4 /
COUNT(*)
----------
315
SQL>
SQL> select /*+ gather_plan_statistics */ count(*)
2 from cust
3 where cust_state = 'MT' and cust_country = 'US'
4 /
COUNT(*)
----------
315
SQL>
SQL> --
SQL> -- Current plan now shows the row estimate matches
SQL> -- the actual count, courtesy of automatic generation
SQL> -- of extended statistics
SQL> --
SQL> select * from table(dbms_xplan.display_cursor(format=>'allstats last'));
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------
SQL_ID 5027bkwuzg1jz, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ count(*) from cust where
cust_state = 'MT' and cust_country = 'US'
Plan hash value: 2062844691
----------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 3 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.01 | 3 |
|* 2 | INDEX RANGE SCAN| CUST_IDX | 1 | 315 | 315 |00:00:00.01 | 3 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("CUST_STATE"='MT' AND "CUST_COUNTRY"='US')
20 rows selected.
SQL>
The estimated rows and actual row values now match, courtesy of the extended statistics on the column grouping (cust_state, cust_country), improving the performance of the query since Oracle can now understand the relationship between these two columns. A similar benefit could be realized by using the cust_city and cust_state columns, and it could also be extended to the use of all three. Notice that other than ‘triggering’ Oracle to monitor column usage no additional work was required to generate the extended statistics or create column groups.
Automatically generating column groups so extended statistics can be gathered may not be applicable to every situation but it can provide better plans when related columns in a table exist. End users measure performance in time, and the less time spent on returning results makes them happier, I’ve found. And better plans mean less time in the database.