Oracle 10gR2 Adaptive Thresholds, Part 1: Overview

Synopsis. Oracle 10g Release 2 (10gR2) has
improved significantly the methodology for tracking performance metrics within
the database. This article – the first in this series – discusses how adaptive
thresholds are designed to improve the detection of threshold violations,
including the ability to discern a false positive from a true warning.

I first started taking advantage of Oracle Enterprise
Manager (OEM) events in Oracle 9iR2. I had a good reason, too: One of my
client’s applications was constantly malfunctioning and would regularly begin
to use up almost all of the database’s available Program Global Area (PGA)
memory. As a result, other user sessions would have virtually no additional
memory to perform sorting operations or, in a few extreme cases, block any
further connections to the database.

Our team activated event tracking for per-session PGA memory
usage by creating a custom user-defined event that Oracle 9i OEM
monitored once every five minutes. We created a special package that contained
various functions for monitoring which user sessions were using more than 30MB
of PGA memory for longer than two monitoring periods (i.e. 10 minutes).
Whenever OEM detected that a single application session was using excessive
amounts of memory, it automatically sent an e-mail to our DBA team so that we
could determine what the problem was.

This “distant early warning” system helped us immeasurably
because we could respond to excessive PGA utilization in a reasonably timely
fashion. However, after a few months of OEM monitoring, I noticed two
interesting patterns:

The upper thresholds for the warning and critical
conditions had to be constantly adjusted.
As the load on our database
server gradually increased over time, our initial warning target of 30MB for
per-session PGA utilization tended to be violated more and more often, and we
would constantly receive warning messages. As a result, we had to readjust the
warning threshold limit gradually so that we’d only get warned about
significant events.

“False positives” tended to be ignored over time.
When we turned on notification for the thresholds, we had a natural tendency to
pay strict attention to every warning we received. After a while, however, I
noticed that human nature kicked in: we grew used to receiving warning
messages, and we began to treat them a bit too lackadaisically. You can guess
the result: Once in a while, we’d simply put off following up on a warning or
critical alert, and by the time we investigated, it was too late to find out
the cause of the alert.

Oracle 10g Metrics Collection and Interpretation: A New Performance Model

This story is a prime example of how Oracle releases prior
to Oracle 10g handled the detection of potential problems in the database.
Since Oracle 10g has greatly expanded the number and types of wait events
from about 290 in prior releases to over 800, it makes it that much simpler to
isolate the exact causes of poor database performance. In fact, the increased
granularity of these wait events is crucial to the success of the automatic
advisor features found in Automatic Database Diagnostic Monitor (ADDM) and its
related Advisor framework.

Oracle 10g also improved the concept of database
thresholds
because a majority of the issues that cause Oracle DBAs the most
grief are being monitored automatically right “out of the box.” For example,
Oracle 10g automatically monitors tablespace space utilization and will issue a
warning or critical message when the tablespace’s size has reached a threshold
of 85% or 97%, respectively. Other conditions monitored include the ratio of
the time the database is spending in a wait state vs. actually performing work,
the average number of seconds it takes to complete a transaction (i.e. response
time
), and how much redo each DML operation is generating.

Another nice feature of Oracle 10g Release 1 (10gR1) is the
ability to create a performance baseline from metrics gathered over the
past days, weeks and months. Once I’ve established and saved such a baseline, I
can easily instruct Oracle 10gR1 to build thresholds based upon the highest
value for a particular metric (for example, the number of user commits per
second
). I can also tell Oracle 10gR1 to build thresholds based on an
additional percentage beyond a baseline value – say, 125% of the number
of user commits per second for the baseline time period.

Even with these improved detection methods, however, Oracle
10gR1 still has two slight drawbacks. First, since it relies upon strict arithmetic
methods for detecting a threshold violation, Oracle 10gR1 can only notify me
when a threshold has been breached. The current value of user commits per
second may be hovering very near but just below the warning threshold value for
several hours, and I would never know that a potential problem is brewing.
Also, when a warning threshold is constantly being breached, I really have only
two choices: ignore the warning as a likely “false positive,” or
continually raise the threshold until I no longer receive the warning.

This conundrum illustrates that I really need to know two
different things whenever a threshold is breached:

  • If a current metric were compared to a threshold as of a
    corresponding point in time, would it even register as a problem?
    For example,
    a value of 300 user commits / second may be entirely normal during my
    database’s nightly batch processing time frame between 7 PM and 10 PM, but
    highly out of the ordinary between the hours 10 AM and 2 PM, when my online
    transaction processing users are performing order entry tasks. To make this
    even more complex, I may need to know this variance on an hour-by-hour
    comparison for some metrics, while for others I’d like to know the variance
    based on a day-by-day (e.g. Monday vs. Tuesday) or weekday vs.
    weekend
    comparison.

  • Is the current level of this metric truly significant?
    Continuing the same scenario, do I really need (or desire!) to be notified
    whenever the number of user commits per second barely exceeds some
    relatively arbitrary threshold? And just as important, wouldn’t I want to know
    when a metric value is just below that same threshold for a significant
    period of time?

Oracle 10gR2 solves these issues with a new concept called adaptive
thresholds
. In a nutshell, an adaptive threshold measures the difference
between the current value for a metric and its corresponding
threshold
as of a particular, yet relative, point in time. Once
Oracle determines that a threshold breach has occurred, it also measures the significance
of the breach in relation to the threshold as a deflection from a nominal
value.

Establishing Metric Baselines

Adaptive thresholds rely upon metric baselines created from
representative samples of metrics taken over a time frame. Oracle will only
allow the creation of metric baselines if the data collected has enough
significance, but this can be resolved by altering the time frames or
granularity of the metrics collected. Metric baselines are further divided into
two types: moving window baseline periods and static baseline periods.

Moving Window Baseline Periods. As its name implies,
a moving window baseline period is based upon a rolling window of time that
extends a minimum of seven days into the past. Oracle also allows the creation
of moving window baseline periods that extend 7, 21, 35, and even 90 days into
the past. These baselines are most appropriate when the system or application
whose performance I am measuring have cyclical workloads that are relatively
predictable.

For example, a hybrid database typically has well-defined
activity cycles. Online transaction processing (OLTP) activity may occur during
the morning through the early evening, followed by batch processing during the
evening, and then followed by exports and backups in the early morning hours.
In this case, the database’s performance will follow a regular “heartbeat”
pattern that repeats every business day, while the weekday vs. weekend
performance pattern may be dramatically different.

Static Baseline Periods. Static baseline periods, on
the other hand, consist of a well-defined time period for which I want to
gather and preserve statistics for later comparison. I can create a static
baseline period based on any two points in time – as long as sufficient data
for the metrics exist. For example, I may want to capture the performance of my
database during the last seven to ten days of the previous calendar quarter as
a static baseline so that I can compare that performance during the upcoming
quarter-ending period, thus monitoring the database for any significant
deflections from the norm.

Time Groups. Regardless of the type of baseline
period selected, Oracle 10gR2 permits me to group data within different time
periods known as time groups. This insures that thresholds have
sufficient significant data for metric baseline creation. Time groups are
broken down into several different patterns as illustrated in Table 1.
Once metric baselines data have been gathered, they can be distributed within a
combination of these daily and/or weekly grouping schemes to insure that there
is enough significance for applying adaptive threshold monitoring.

Table 1. Time Groups

Daily Groupings Available:

Time Grouping

Statistics Grouped Within

Hour of Day

Metrics are aggregated
within every hour of the day. Most useful when there are significant
variations to be tracked within individual hours of every day of the week

Day and Night

Metrics are aggregated
within two time periods: daytime (0700 – 1900) and nighttime (1900-0700).
Most useful when obvious differences in system performance need to be
monitored between day and night

All Hours

Metrics are aggregated
within each day of the week. Best used when no discernable pattern is obvious
between daytime and nighttime performance

Weekly Groupings Available:

Time Group

Statistics Grouped Within

Day of Week

Metrics are aggregated
within every day of the week. Most useful when there are significant
variations to be tracked within individual days of the week

Weekdays and Weekends

Metrics are aggregated
within two ranges: Monday thru Friday and Saturday thru Sunday. Most useful
in differentiating between normal business or work days vs. weekend periods

All Days

Metrics are aggregated for
the week itself. Best used when no discernable pattern is obvious between
individual days, or weekends vs. weekdays

Adaptive Thresholds: It’s All About the Deflection

Once I have set up the appropriate baseline periods and
defined the appropriate time group to insure sufficient significance, I can set
adaptive thresholds based on one of two measurement methodologies: percent
of maximum
thresholds or significance level thresholds.

Significance Level Thresholds. These are percentile-based
and measure the deflection from the nominal value based upon one of four
possible levels of deviation from the norm. A significance level threshold
therefore helps to eliminate false positives while highlighting those truly
significant variances from an expected value. Table 2 lists the
available significance levels:

Table 2. Significance Level Thresholds

Significance

Percentile

Occurrences

High

0.95

5 in 100
occurrences

Very High

0.99

1 in 100
occurrences

Severe

0.999

1 in 1,000
occurrences

Extreme

0.9999

1 in 10,000
occurrences

Percent of Maximum Thresholds. This type of threshold
setting uses a simple arithmetic calculation — the percentage of deflection
based on the value at the 99th percentile for the selected metric
baseline – to determine if a threshold has been violated significantly.

Tracked Thresholds. Of the several dozen metrics that
Oracle 10g provides across the entire database, it is important to note that
Oracle 10gR2 only provides the ability to track adaptive thresholds for 15 of
the most meaningful, broken down across three areas of focus:

  • Performance metrics track three indicators of the amount
    of effort
    the database is expending to provide the current level of
    performance.

  • Workload Volume metrics focus on eight of the most
    meaningful indicators of how much work the database is performing.

  • Workload Type metrics depend on four indicators that
    describe how the workload is distributed within the database’s overall
    performance.

Table 3 lists the 15 metrics that support adaptive
thresholds, the rates and ratios being tracked, and what each metric indicates:

Table 3. Adaptive Thresholds Tracked

Performance Metrics

Threshold

Rate

Description

Database Time

Centiseconds / Second

The amount of time the
database is spending doing actual work, i.e., not in idle status or spent
“spinning”

Response Time

Per Transaction

How much time it takes to
complete per each logical transaction

System Response Time

Per Second

How well the system is
responding to requests

Workload Volume Metrics

Threshold

Rate

Description

Executions

Per Second

How many statements have
been executed

Redo Generated

Per Second

How much redo transaction
volume has been generated

Network Bytes

Per Second

How many bytes are being
transmitted across the network

Physical Writes

Per Second

How many physical block
writes have occurred

Physical Reads

Per Second

How many physical block
reads have occurred

Current Logons

Count

How many user sessions have
logged into the system

User Calls

Per Second

How many user calls have
been processed

Number of Transactions

Per Second

How many transactions have
been processed

Workload Type Metrics

Threshold

Rate

Description

Database Blocks Changed

Per Transaction

How many database blocks
have been modified. A good indicator of how much DML vs. queries have
occurred

Enqueue Requests

Centiseconds

How long a transaction has
been waiting (e.g. for a row lock to clear)

Total Parses

Per Transaction

How many times a SQL
statement has had to be parsed, either soft (already existing in the Library
Cache) or hard (not found in the Library Cache)

Session Logical Reads

Per Transaction

The average number of blocks
that have been read from the database buffer cache – a good indicator of how
many blocks are being read per each logical unit of work

Next Steps

In Part 2 of this series, I’ll provide actual examples of
how to create metric baselines using both static metrics and rolling time
periods and then demonstrate how to configure adaptive thresholds using these
metric baselines so that you can elevate your Oracle 10gR2 database’s warning
and critical thresholds to new heights of sensitivity and meaningfulness.

References and Additional Reading

Even though I have hopefully provided enough technical
information in this article to encourage you to explore with these features, I
also strongly suggest that you first review the corresponding detailed Oracle
documentation before proceeding with any experiments. Actual implementation of
these features should commence only after a crystal-clear understanding exists.
Please note that I have drawn upon the following Oracle 10gR2 documentation for
the deeper technical details of this article:

B14214-01 Oracle 10gR2
Database New Features Guide

B14231-01 Oracle 10gR2
Database Administrator’s Guide

B14258-01 Oracle 10gR2 PL/SQL
Packages and Types Reference

»


See All Articles by Columnist
Jim Czuprynski

Jim Czuprynski
Jim Czuprynski
Jim Czuprynski has accumulated over 30 years of experience during his information technology career. He has filled diverse roles at several Fortune 1000 companies in those three decades - mainframe programmer, applications developer, business analyst, and project manager - before becoming an Oracle database administrator in 2001. He currently holds OCP certification for Oracle 9i, 10g and 11g. Jim teaches the core Oracle University database administration courses on behalf of Oracle and its Education Partners throughout the United States and Canada, instructing several hundred Oracle DBAs since 2005. He was selected as Oracle Education Partner Instructor of the Year in 2009. Jim resides in Bartlett, Illinois, USA with his wife Ruth, whose career as a project manager and software quality assurance manager for a multinational insurance company makes for interesting marital discussions. He enjoys cross-country skiing, biking, bird watching, and writing about his life experiences in the field of information technology.

Latest Articles