Using Adaptive Thresholds to Monitor Oracle Database Performance

AWR baselines have been available since Oracle 10g; Oracle 11g enhances those baselines with new ways to generate them and new tricks in OEM to use adaptive thresholds to monitor the performance of the database. Starting off let’s look at the way generating baselines has changed. The original 10g implementation still exists, with a new wrinkle of using date/time values to retrieve the snap ids:

SQL> 
SQL> -- Create a static(fixed) baseline using the 10g and enhanced 11g methods
SQL> 
SQL> SET SERVEROUTPUT ON
SQL> DECLARE
  2    l_return  NUMBER;
  3  BEGIN
  4    -- Using procedures.
  5    DBMS_WORKLOAD_REPOSITORY.create_baseline(
  6  	 start_snap_id => 5000,
  7  	 end_snap_id   => 5001,
  8  	 baseline_name => 'test1_bl',
  9  	 expiration    => 60);
 10  
 11    DBMS_WORKLOAD_REPOSITORY.create_baseline(
 12  	 start_time    => TO_DATE('23-APR-2012 17:00', 'DD-MON-YYYY HH24:MI'),
 13  	 end_time      => TO_DATE('23-APR-2012 18:00', 'DD-MON-YYYY HH24:MI'),
 14  	 baseline_name => 'test2_bl',
 15  	 expiration    => NULL);
 16  
 17    -- Using functions.
 18    l_return := DBMS_WORKLOAD_REPOSITORY.create_baseline(
 19  		     start_snap_id => 5000,
 20  		     end_snap_id   => 5001,
 21  		     baseline_name => 'test3_bl',
 22  		     expiration    => 30);
 23    DBMS_OUTPUT.put_line('Return: ' || l_return);
 24  
 25    l_return := DBMS_WORKLOAD_REPOSITORY.create_baseline(
 26  		     start_time    => TO_DATE('23-APR-2012 17:00', 'DD-MON-YYYY HH24:MI'),
 27  		     end_time	   => TO_DATE('23-APR-2012 18:00', 'DD-MON-YYYY HH24:MI'),
 28  		     baseline_name => 'test4_bl',
 29  		     expiration    => NULL);
 30  
 31    DBMS_OUTPUT.put_line('Return: ' || l_return);
 32  END;
 33  /
Return: 31
Return: 32

PL/SQL procedure successfully completed.

SQL> 

Also available in 11g is the ability to create baseline templates:

SQL> 
SQL> -- Create a baseline template
SQL> 
SQL> BEGIN
  2    DBMS_WORKLOAD_REPOSITORY.create_baseline_template(
  3  	 start_time    => trunc(sysdate),
  4  	 end_time      => sysdate,
  5  	 baseline_name => '23_apr_012_00_05_bl',
  6  	 template_name => '23_apr_012_00_05_tp',
  7  	 expiration    => 100);
  8  END;
  9  /

PL/SQL procedure successfully completed.

SQL> 
SQL> -- Create a repeating baseline template
SQL> 
SQL> BEGIN
  2    DBMS_WORKLOAD_REPOSITORY.create_baseline_template(
  3  	day_of_week	     => 'MONDAY',
  4  	hour_in_day	     => 0,
  5  	duration	     => 5,
  6  	start_time	     => SYSDATE,
  7  	end_time	     => ADD_MONTHS(SYSDATE, 6),
  8  	baseline_name_prefix => 'monday_morning_bl_',
  9  	template_name	     => 'monday_morning_tp',
 10  	expiration	     => NULL);
 11  END;
 12  /

PL/SQL procedure successfully completed.

SQL> 

Creating such templates allows the DBA to prepare to capture baselines for future time periods. Repeating baseline templates also require some ‘scheduling’ information (the day of the week to start the baseline, the hour at which to start, the duration of the baseline window) as well as the overall starting time and ending time for the baseline series.

Viewing the defined baselines is a simple task using the DBA_HIST_BASELINE view:

SQL> 
SQL> -- View created baselines
SQL> 
SQL> COLUMN baseline_name FORMAT A15
SQL> 
SQL> SELECT baseline_id, baseline_name, START_SNAP_ID,
  2  	    TO_CHAR(start_snap_time, 'DD-MON-YYYY HH24:MI') AS start_snap_time,
  3  	    END_SNAP_ID,
  4  	    TO_CHAR(end_snap_time, 'DD-MON-YYYY HH24:MI') AS end_snap_time
  5  FROM   dba_hist_baseline
  6  WHERE  baseline_type = 'STATIC'
  7  ORDER BY baseline_id;

BASELINE_ID BASELINE_NAME   START_SNAP_ID START_SNAP_TIME            END_SNAP_ID END_SNAP_TIME
----------- --------------- ------------- -------------------------- ----------- --------------------------
         29 test1_bl                 5000 20-APR-2012 01:45                 5001 20-APR-2012 02:00
         30 test2_bl                 5349 23-APR-2012 17:00                 5353 23-APR-2012 18:00
         31 test3_bl                 5000 20-APR-2012 01:45                 5001 20-APR-2012 02:00
         32 test4_bl                 5349 23-APR-2012 17:00                 5353 23-APR-2012 18:00

SQL> 

The real power of the enhancements is the definition of a moving window, allowing Oracle to automatically adjust metrics for the defined window duration and retention period. These are defined using two procedures in the DBMS_WORKLOAD_REPOSITORY package — MODIFY_SNAPSHOT_SETTINGS and MODIFY_BASELINE_WINDOW_SIZE:

SQL> 
SQL> -- Get the current retention and window size
SQL> 
SQL> SELECT retention FROM dba_hist_wr_control;

RETENTION
---------------------------------------------------------------------------
+00008 00:00:00.0

SQL> 
SQL> SELECT moving_window_size
  2  FROM   dba_hist_baseline
  3  WHERE  baseline_type = 'MOVING_WINDOW';

MOVING_WINDOW_SIZE
------------------
                 8

SQL> 
SQL> -- Set the retention and window size to 30 days
SQL> 
SQL> BEGIN
  2    DBMS_WORKLOAD_REPOSITORY.modify_snapshot_settings(
  3  	 retention => 43200);  -- Minutes (= 30 Days).
  4  END;
  5  /

PL/SQL procedure successfully completed.

SQL> 
SQL> BEGIN
  2    DBMS_WORKLOAD_REPOSITORY.modify_baseline_window_size(
  3  	 window_size => 30);
  4  END;
  5  /

PL/SQL procedure successfully completed.

SQL> 
SQL> -- Verify settings are changed
SQL> 
SQL> SELECT retention FROM dba_hist_wr_control;

RETENTION
---------------------------------------------------------------------------
+00030 00:00:00.0

SQL> 
SQL> SELECT moving_window_size
  2  FROM   dba_hist_baseline
  3  WHERE  baseline_type = 'MOVING_WINDOW';

MOVING_WINDOW_SIZE
------------------
                30

SQL> 

Oracle recommends a window size of at least 30 when configuring adaptive thresholds, which we set in the example shown above. The moving window metrics can be viewed using the pipelined DBMS_WORKLOAD_REPOSITORY function SELECT_BASELINE_METRIC with the supplied value of SYSTEM_MOVING_WINDOW, as shown in this text file.

Using the above query, the window can be monitored with a script, if necessary, although the preferred method is to set adaptive thresholds in OEM and let Enterprise Manager do the work. In 11g there are three types of thresholds: fixed value, such as 80% of CPU consumed, percent of maximum (which is recalculated as the window moves) and significance level. The third is probably the least documented so let’s add to the pool of information and try to explain it here.

The basic idea behind using significance level thresholds for alerting is that Oracle is attempting to detect outliers (significant variations) in the distribution of metric values. Using the historical metric data from AWR Oracle can identify values for 25th, 50th, 75th, 90th, 95th and 99th percentiles; a curve-fitting algorithm is employed to extrapolate the 99.9th and 99.99th percentiles based on time values (hence the moving window). This is where dynamic baselines (as I call them) come into play. A dynamic baseline is generated by the moving window since the base for that window shifts over time and alters the metric averages.

The user will specify the alert level, which equates to one of these percentile values:

  • High 95th percentile
  • Very High 99th percentile
  • Severe 99.9th percentile
  • Extreme 99.99th percentile

Using the SYSTEM_MOVING_WINDOW baseline, Oracle will automatically determine the threshold level for a metric corresponding to the selected significance level for the current period. If you set a significance level of Extreme, Oracle would generate an alert on values that would only be expected to be seen once in every 10000 observations (approximately once a year for hourly thresholds).

Significance level thresholds are designed to produce alert values for key metrics; the values that are observed above the threshold are unusual with respect to the statistics (i.e. significant) at the Nth percentile based on actual data observed for this metric over the SYSTEM_MOVING_WINDOW baseline. The alert level sets N according to the ‘severity’ chosen (listed above). Systems with relatively stable performance should show stability, statistically speaking, in core performance metrics and when rare, high-impact events occur these will be reflected in highly unusual observations. Significance level thresholds allow the DBA to specify how alerts are generated in terms of “how unusual” the deviations are.

Once set the thresholds only change when the DBA chooses to change them; setting such thresholds require a knowledge of the metrics, the data and the application which accesses that data and an awareness of the normal workload. Of course expectations and thresholds vary with the application so an intimate knowledge of these values is essential in knowing how to set the significance. As installations vary with time (data skew changes, activity increases, sudden bursts of activity) and such changes can affect the metrics and, as a result, the thresholds so careful monitoring of these changes is necessary if the alerts are to be useful and meaningful. These areas are the ones adaptive thresholds attempt to address; the methodology is as follows:

A) Thresholds are computed by the system based upon prior observations of this metric. Basically speaking Oracle tries to take the guesswork (and manual analysis) out of the picture by automatically recomputing metric values over time. Since the triggering values do not change without manual intervention the greater frequencey of alerts is a good sign the usage of the system has changed, alerting the DBA to consider new settings that will again fall outside of the current normal range.

B) Thresholds are periodically recalculated with respect to the SYSTEM_MOVING_WINDOW baseline. The thresholds adapt to slowly evolving workload or demand.

C) Metric statistics for adaptive thresholds are computed using grouping “buckets” over the common workload periods (day/night, weekday/weekend, etc.) Threshold resets can happen as frequently as every hour depending on the load and demand.

In a nutshell Oracle 11g determines, and sets, metric alert thresholds automatically based upon actual observations of the system relative to the triggering thresholds set by the DBA. With a good knowledge of the system the DBA can set reasonable levels by which to generate alerts.

Whether the DBA uses a fixed value methodology, a percent of maximum tact or chooses to set a significance level, Oracle’s adaptive thresholds can make maintaining the performance of a database system easier by automatically recalculating the metric values based upon a moving window (which is like a moving average in a way). As the window shifts and demand increases the baseline also changes, giving a new reference point for the increased performance metrics, keeping the overall picture consistent with the data, making it easier to determine where the performance bottlenecks and issues are.

Adaptive thresholds are, in my mind, a very useful addition to 11g and well worth the time spent learning how to set and use them.

See all articles by David Fitzjarrell

David Fitzjarrell
David Fitzjarrell
David Fitzjarrell has more than 20 years of administration experience with various releases of the Oracle DBMS. He has installed the Oracle software on many platforms, including UNIX, Windows and Linux, and monitored and tuned performance in those environments. He is knowledgeable in the traditional tools for performance tuning – the Oracle Wait Interface, Statspack, event 10046 and 10053 traces, tkprof, explain plan and autotrace – and has used these to great advantage at the U.S. Postal Service, American Airlines/SABRE, ConocoPhilips and SiriusXM Radio, among others, to increase throughput and improve the quality of the production system. He has also set up scripts to regularly monitor available space and set thresholds to notify DBAs of impending space shortages before they affect the production environment. These scripts generate data which can also used to trend database growth over time, aiding in capacity planning. He has used RMAN, Streams, RAC and Data Guard in Oracle installations to ensure full recoverability and failover capabilities as well as high availability, and has configured a 'cascading' set of DR databases using the primary DR databases as the source, managing the archivelog transfers manually and montoring, through scripts, the health of these secondary DR databases. He has also used ASM, ASMM and ASSM to improve performance and manage storage and shared memory.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles