Synopsis. Oracle 10g Release 2 (10gR2) has
improved significantly the methodology for tracking performance metrics within
the database. This article – the first in this series – discusses how adaptive
thresholds are designed to improve the detection of threshold violations,
including the ability to discern a false positive from a true warning.
I first started taking advantage of Oracle Enterprise
Manager (OEM) events in Oracle 9iR2. I had a good reason, too: One of my
client’s applications was constantly malfunctioning and would regularly begin
to use up almost all of the database’s available Program Global Area (PGA)
memory. As a result, other user sessions would have virtually no additional
memory to perform sorting operations or, in a few extreme cases, block any
further connections to the database.
Our team activated event tracking for per-session PGA memory
usage by creating a custom user-defined event that Oracle 9i OEM
monitored once every five minutes. We created a special package that contained
various functions for monitoring which user sessions were using more than 30MB
of PGA memory for longer than two monitoring periods (i.e. 10 minutes).
Whenever OEM detected that a single application session was using excessive
amounts of memory, it automatically sent an e-mail to our DBA team so that we
could determine what the problem was.
This “distant early warning” system helped us immeasurably
because we could respond to excessive PGA utilization in a reasonably timely
fashion. However, after a few months of OEM monitoring, I noticed two
interesting patterns:
The upper thresholds for the warning and critical
conditions had to be constantly adjusted. As the load on our database
server gradually increased over time, our initial warning target of 30MB for
per-session PGA utilization tended to be violated more and more often, and we
would constantly receive warning messages. As a result, we had to readjust the
warning threshold limit gradually so that we’d only get warned about
significant events.
“False positives” tended to be ignored over time.
When we turned on notification for the thresholds, we had a natural tendency to
pay strict attention to every warning we received. After a while, however, I
noticed that human nature kicked in: we grew used to receiving warning
messages, and we began to treat them a bit too lackadaisically. You can guess
the result: Once in a while, we’d simply put off following up on a warning or
critical alert, and by the time we investigated, it was too late to find out
the cause of the alert.
Oracle 10g Metrics Collection and Interpretation: A New Performance Model
This story is a prime example of how Oracle releases prior
to Oracle 10g handled the detection of potential problems in the database.
Since Oracle 10g has greatly expanded the number and types of wait events
from about 290 in prior releases to over 800, it makes it that much simpler to
isolate the exact causes of poor database performance. In fact, the increased
granularity of these wait events is crucial to the success of the automatic
advisor features found in Automatic Database Diagnostic Monitor (ADDM) and its
related Advisor framework.
Oracle 10g also improved the concept of database
thresholds because a majority of the issues that cause Oracle DBAs the most
grief are being monitored automatically right “out of the box.” For example,
Oracle 10g automatically monitors tablespace space utilization and will issue a
warning or critical message when the tablespace’s size has reached a threshold
of 85% or 97%, respectively. Other conditions monitored include the ratio of
the time the database is spending in a wait state vs. actually performing work,
the average number of seconds it takes to complete a transaction (i.e. response
time), and how much redo each DML operation is generating.
Another nice feature of Oracle 10g Release 1 (10gR1) is the
ability to create a performance baseline from metrics gathered over the
past days, weeks and months. Once I’ve established and saved such a baseline, I
can easily instruct Oracle 10gR1 to build thresholds based upon the highest
value for a particular metric (for example, the number of user commits per
second). I can also tell Oracle 10gR1 to build thresholds based on an
additional percentage beyond a baseline value – say, 125% of the number
of user commits per second for the baseline time period.
Even with these improved detection methods, however, Oracle
10gR1 still has two slight drawbacks. First, since it relies upon strict arithmetic
methods for detecting a threshold violation, Oracle 10gR1 can only notify me
when a threshold has been breached. The current value of user commits per
second may be hovering very near but just below the warning threshold value for
several hours, and I would never know that a potential problem is brewing.
Also, when a warning threshold is constantly being breached, I really have only
two choices: ignore the warning as a likely “false positive,” or
continually raise the threshold until I no longer receive the warning.
This conundrum illustrates that I really need to know two
different things whenever a threshold is breached:
-
If a current metric were compared to a threshold as of a
corresponding point in time, would it even register as a problem? For example,
a value of 300 user commits / second may be entirely normal during my
database’s nightly batch processing time frame between 7 PM and 10 PM, but
highly out of the ordinary between the hours 10 AM and 2 PM, when my online
transaction processing users are performing order entry tasks. To make this
even more complex, I may need to know this variance on an hour-by-hour
comparison for some metrics, while for others I’d like to know the variance
based on a day-by-day (e.g. Monday vs. Tuesday) or weekday vs.
weekend comparison. -
Is the current level of this metric truly significant?
Continuing the same scenario, do I really need (or desire!) to be notified
whenever the number of user commits per second barely exceeds some
relatively arbitrary threshold? And just as important, wouldn’t I want to know
when a metric value is just below that same threshold for a significant
period of time?
Oracle 10gR2 solves these issues with a new concept called adaptive
thresholds. In a nutshell, an adaptive threshold measures the difference
between the current value for a metric and its corresponding
threshold as of a particular, yet relative, point in time. Once
Oracle determines that a threshold breach has occurred, it also measures the significance
of the breach in relation to the threshold as a deflection from a nominal
value.
Establishing Metric Baselines
Adaptive thresholds rely upon metric baselines created from
representative samples of metrics taken over a time frame. Oracle will only
allow the creation of metric baselines if the data collected has enough
significance, but this can be resolved by altering the time frames or
granularity of the metrics collected. Metric baselines are further divided into
two types: moving window baseline periods and static baseline periods.
Moving Window Baseline Periods. As its name implies,
a moving window baseline period is based upon a rolling window of time that
extends a minimum of seven days into the past. Oracle also allows the creation
of moving window baseline periods that extend 7, 21, 35, and even 90 days into
the past. These baselines are most appropriate when the system or application
whose performance I am measuring have cyclical workloads that are relatively
predictable.
For example, a hybrid database typically has well-defined
activity cycles. Online transaction processing (OLTP) activity may occur during
the morning through the early evening, followed by batch processing during the
evening, and then followed by exports and backups in the early morning hours.
In this case, the database’s performance will follow a regular “heartbeat”
pattern that repeats every business day, while the weekday vs. weekend
performance pattern may be dramatically different.
Static Baseline Periods. Static baseline periods, on
the other hand, consist of a well-defined time period for which I want to
gather and preserve statistics for later comparison. I can create a static
baseline period based on any two points in time – as long as sufficient data
for the metrics exist. For example, I may want to capture the performance of my
database during the last seven to ten days of the previous calendar quarter as
a static baseline so that I can compare that performance during the upcoming
quarter-ending period, thus monitoring the database for any significant
deflections from the norm.
Time Groups. Regardless of the type of baseline
period selected, Oracle 10gR2 permits me to group data within different time
periods known as time groups. This insures that thresholds have
sufficient significant data for metric baseline creation. Time groups are
broken down into several different patterns as illustrated in Table 1.
Once metric baselines data have been gathered, they can be distributed within a
combination of these daily and/or weekly grouping schemes to insure that there
is enough significance for applying adaptive threshold monitoring.
Table 1. Time Groups |
|
Daily Groupings Available: |
|
Time Grouping |
Statistics Grouped Within |
Hour of Day |
Metrics are aggregated |
Day and Night |
Metrics are aggregated |
All Hours |
Metrics are aggregated |
Weekly Groupings Available: |
|
Time Group |
Statistics Grouped Within |
Day of Week |
Metrics are aggregated |
Weekdays and Weekends |
Metrics are aggregated |
All Days |
Metrics are aggregated for |
Adaptive Thresholds: It’s All About the Deflection
Once I have set up the appropriate baseline periods and
defined the appropriate time group to insure sufficient significance, I can set
adaptive thresholds based on one of two measurement methodologies: percent
of maximum thresholds or significance level thresholds.
Significance Level Thresholds. These are percentile-based
and measure the deflection from the nominal value based upon one of four
possible levels of deviation from the norm. A significance level threshold
therefore helps to eliminate false positives while highlighting those truly
significant variances from an expected value. Table 2 lists the
available significance levels:
Table 2. Significance Level Thresholds |
||
Significance |
Percentile |
Occurrences |
High |
0.95 |
5 in 100 |
Very High |
0.99 |
1 in 100 |
Severe |
0.999 |
1 in 1,000 |
Extreme |
0.9999 |
1 in 10,000 |
Percent of Maximum Thresholds. This type of threshold
setting uses a simple arithmetic calculation — the percentage of deflection
based on the value at the 99th percentile for the selected metric
baseline – to determine if a threshold has been violated significantly.
Tracked Thresholds. Of the several dozen metrics that
Oracle 10g provides across the entire database, it is important to note that
Oracle 10gR2 only provides the ability to track adaptive thresholds for 15 of
the most meaningful, broken down across three areas of focus:
-
Performance metrics track three indicators of the amount
of effort the database is expending to provide the current level of
performance. -
Workload Volume metrics focus on eight of the most
meaningful indicators of how much work the database is performing. -
Workload Type metrics depend on four indicators that
describe how the workload is distributed within the database’s overall
performance.
Table 3 lists the 15 metrics that support adaptive
thresholds, the rates and ratios being tracked, and what each metric indicates:
Table 3. Adaptive Thresholds Tracked |
||
Performance Metrics |
||
Threshold |
Rate |
Description |
Database Time |
Centiseconds / Second |
The amount of time the |
Response Time |
Per Transaction |
How much time it takes to |
System Response Time |
Per Second |
How well the system is |
Workload Volume Metrics |
||
Threshold |
Rate |
Description |
Executions |
Per Second |
How many statements have |
Redo Generated |
Per Second |
How much redo transaction |
Network Bytes |
Per Second |
How many bytes are being |
Physical Writes |
Per Second |
How many physical block |
Physical Reads |
Per Second |
How many physical block |
Current Logons |
Count |
How many user sessions have |
User Calls |
Per Second |
How many user calls have |
Number of Transactions |
Per Second |
How many transactions have |
Workload Type Metrics |
||
Threshold |
Rate |
Description |
Database Blocks Changed |
Per Transaction |
How many database blocks |
Enqueue Requests |
Centiseconds |
How long a transaction has |
Total Parses |
Per Transaction |
How many times a SQL |
Session Logical Reads |
Per Transaction |
The average number of blocks |
Next Steps
In Part 2 of this series, I’ll provide actual examples of
how to create metric baselines using both static metrics and rolling time
periods and then demonstrate how to configure adaptive thresholds using these
metric baselines so that you can elevate your Oracle 10gR2 database’s warning
and critical thresholds to new heights of sensitivity and meaningfulness.
References and Additional Reading
Even though I have hopefully provided enough technical
information in this article to encourage you to explore with these features, I
also strongly suggest that you first review the corresponding detailed Oracle
documentation before proceeding with any experiments. Actual implementation of
these features should commence only after a crystal-clear understanding exists.
Please note that I have drawn upon the following Oracle 10gR2 documentation for
the deeper technical details of this article:
B14214-01 Oracle 10gR2
Database New Features Guide
B14231-01 Oracle 10gR2
Database Administrator’s Guide
B14258-01 Oracle 10gR2 PL/SQL
Packages and Types Reference