Oracle 10gR2 Adaptive Thresholds, Part 1: Overview
June 22, 2006
Synopsis. Oracle 10g Release 2 (10gR2) has improved significantly the methodology for tracking performance metrics within the database. This article the first in this series discusses how adaptive thresholds are designed to improve the detection of threshold violations, including the ability to discern a false positive from a true warning.
I first started taking advantage of Oracle Enterprise Manager (OEM) events in Oracle 9iR2. I had a good reason, too: One of my client's applications was constantly malfunctioning and would regularly begin to use up almost all of the database's available Program Global Area (PGA) memory. As a result, other user sessions would have virtually no additional memory to perform sorting operations or, in a few extreme cases, block any further connections to the database.
Our team activated event tracking for per-session PGA memory usage by creating a custom user-defined event that Oracle 9i OEM monitored once every five minutes. We created a special package that contained various functions for monitoring which user sessions were using more than 30MB of PGA memory for longer than two monitoring periods (i.e. 10 minutes). Whenever OEM detected that a single application session was using excessive amounts of memory, it automatically sent an e-mail to our DBA team so that we could determine what the problem was.
This "distant early warning" system helped us immeasurably because we could respond to excessive PGA utilization in a reasonably timely fashion. However, after a few months of OEM monitoring, I noticed two interesting patterns:
The upper thresholds for the warning and critical conditions had to be constantly adjusted. As the load on our database server gradually increased over time, our initial warning target of 30MB for per-session PGA utilization tended to be violated more and more often, and we would constantly receive warning messages. As a result, we had to readjust the warning threshold limit gradually so that we'd only get warned about significant events.
"False positives" tended to be ignored over time. When we turned on notification for the thresholds, we had a natural tendency to pay strict attention to every warning we received. After a while, however, I noticed that human nature kicked in: we grew used to receiving warning messages, and we began to treat them a bit too lackadaisically. You can guess the result: Once in a while, we'd simply put off following up on a warning or critical alert, and by the time we investigated, it was too late to find out the cause of the alert.
Oracle 10g Metrics Collection and Interpretation: A New Performance Model
This story is a prime example of how Oracle releases prior to Oracle 10g handled the detection of potential problems in the database. Since Oracle 10g has greatly expanded the number and types of wait events from about 290 in prior releases to over 800, it makes it that much simpler to isolate the exact causes of poor database performance. In fact, the increased granularity of these wait events is crucial to the success of the automatic advisor features found in Automatic Database Diagnostic Monitor (ADDM) and its related Advisor framework.
Oracle 10g also improved the concept of database thresholds because a majority of the issues that cause Oracle DBAs the most grief are being monitored automatically right "out of the box." For example, Oracle 10g automatically monitors tablespace space utilization and will issue a warning or critical message when the tablespace's size has reached a threshold of 85% or 97%, respectively. Other conditions monitored include the ratio of the time the database is spending in a wait state vs. actually performing work, the average number of seconds it takes to complete a transaction (i.e. response time), and how much redo each DML operation is generating.
Another nice feature of Oracle 10g Release 1 (10gR1) is the ability to create a performance baseline from metrics gathered over the past days, weeks and months. Once I've established and saved such a baseline, I can easily instruct Oracle 10gR1 to build thresholds based upon the highest value for a particular metric (for example, the number of user commits per second). I can also tell Oracle 10gR1 to build thresholds based on an additional percentage beyond a baseline value say, 125% of the number of user commits per second for the baseline time period.
Even with these improved detection methods, however, Oracle 10gR1 still has two slight drawbacks. First, since it relies upon strict arithmetic methods for detecting a threshold violation, Oracle 10gR1 can only notify me when a threshold has been breached. The current value of user commits per second may be hovering very near but just below the warning threshold value for several hours, and I would never know that a potential problem is brewing. Also, when a warning threshold is constantly being breached, I really have only two choices: ignore the warning as a likely "false positive," or continually raise the threshold until I no longer receive the warning.
This conundrum illustrates that I really need to know two different things whenever a threshold is breached:
Oracle 10gR2 solves these issues with a new concept called adaptive thresholds. In a nutshell, an adaptive threshold measures the difference between the current value for a metric and its corresponding threshold as of a particular, yet relative, point in time. Once Oracle determines that a threshold breach has occurred, it also measures the significance of the breach in relation to the threshold as a deflection from a nominal value.
Establishing Metric Baselines
Adaptive thresholds rely upon metric baselines created from representative samples of metrics taken over a time frame. Oracle will only allow the creation of metric baselines if the data collected has enough significance, but this can be resolved by altering the time frames or granularity of the metrics collected. Metric baselines are further divided into two types: moving window baseline periods and static baseline periods.
Moving Window Baseline Periods. As its name implies, a moving window baseline period is based upon a rolling window of time that extends a minimum of seven days into the past. Oracle also allows the creation of moving window baseline periods that extend 7, 21, 35, and even 90 days into the past. These baselines are most appropriate when the system or application whose performance I am measuring have cyclical workloads that are relatively predictable.
For example, a hybrid database typically has well-defined activity cycles. Online transaction processing (OLTP) activity may occur during the morning through the early evening, followed by batch processing during the evening, and then followed by exports and backups in the early morning hours. In this case, the database's performance will follow a regular "heartbeat" pattern that repeats every business day, while the weekday vs. weekend performance pattern may be dramatically different.
Static Baseline Periods. Static baseline periods, on the other hand, consist of a well-defined time period for which I want to gather and preserve statistics for later comparison. I can create a static baseline period based on any two points in time as long as sufficient data for the metrics exist. For example, I may want to capture the performance of my database during the last seven to ten days of the previous calendar quarter as a static baseline so that I can compare that performance during the upcoming quarter-ending period, thus monitoring the database for any significant deflections from the norm.
Time Groups. Regardless of the type of baseline period selected, Oracle 10gR2 permits me to group data within different time periods known as time groups. This insures that thresholds have sufficient significant data for metric baseline creation. Time groups are broken down into several different patterns as illustrated in Table 1. Once metric baselines data have been gathered, they can be distributed within a combination of these daily and/or weekly grouping schemes to insure that there is enough significance for applying adaptive threshold monitoring.
Adaptive Thresholds: It's All About the Deflection
Once I have set up the appropriate baseline periods and defined the appropriate time group to insure sufficient significance, I can set adaptive thresholds based on one of two measurement methodologies: percent of maximum thresholds or significance level thresholds.
Significance Level Thresholds. These are percentile-based and measure the deflection from the nominal value based upon one of four possible levels of deviation from the norm. A significance level threshold therefore helps to eliminate false positives while highlighting those truly significant variances from an expected value. Table 2 lists the available significance levels:
Percent of Maximum Thresholds. This type of threshold setting uses a simple arithmetic calculation -- the percentage of deflection based on the value at the 99th percentile for the selected metric baseline to determine if a threshold has been violated significantly.
Tracked Thresholds. Of the several dozen metrics that Oracle 10g provides across the entire database, it is important to note that Oracle 10gR2 only provides the ability to track adaptive thresholds for 15 of the most meaningful, broken down across three areas of focus:
Table 3 lists the 15 metrics that support adaptive thresholds, the rates and ratios being tracked, and what each metric indicates:
In Part 2 of this series, I'll provide actual examples of how to create metric baselines using both static metrics and rolling time periods and then demonstrate how to configure adaptive thresholds using these metric baselines so that you can elevate your Oracle 10gR2 database's warning and critical thresholds to new heights of sensitivity and meaningfulness.
References and Additional Reading
Even though I have hopefully provided enough technical information in this article to encourage you to explore with these features, I also strongly suggest that you first review the corresponding detailed Oracle documentation before proceeding with any experiments. Actual implementation of these features should commence only after a crystal-clear understanding exists. Please note that I have drawn upon the following Oracle 10gR2 documentation for the deeper technical details of this article:
B14214-01 Oracle 10gR2 Database New Features Guide
B14231-01 Oracle 10gR2 Database Administrator's Guide
B14258-01 Oracle 10gR2 PL/SQL Packages and Types Reference