SHARE

Oracle 10gR2 Adaptive Thresholds, Part 1: Overview

Written By

Jun 22, 2006

10 minute read

Synopsis. Oracle 10g Release 2 (10gR2) has
improved significantly the methodology for tracking performance metrics within
the database. This article – the first in this series – discusses how adaptive
thresholds are designed to improve the detection of threshold violations,
including the ability to discern a false positive from a true warning.

I first started taking advantage of Oracle Enterprise
Manager (OEM) events in Oracle 9iR2. I had a good reason, too: One of my
client’s applications was constantly malfunctioning and would regularly begin
to use up almost all of the database’s available Program Global Area (PGA)
memory. As a result, other user sessions would have virtually no additional
memory to perform sorting operations or, in a few extreme cases, block any
further connections to the database.

Our team activated event tracking for per-session PGA memory
usage by creating a custom user-defined event that Oracle 9i OEM
monitored once every five minutes. We created a special package that contained
various functions for monitoring which user sessions were using more than 30MB
of PGA memory for longer than two monitoring periods (i.e. 10 minutes).
Whenever OEM detected that a single application session was using excessive
amounts of memory, it automatically sent an e-mail to our DBA team so that we
could determine what the problem was.

This “distant early warning” system helped us immeasurably
because we could respond to excessive PGA utilization in a reasonably timely
fashion. However, after a few months of OEM monitoring, I noticed two
interesting patterns:

The upper thresholds for the warning and critical
conditions had to be constantly adjusted. As the load on our database
server gradually increased over time, our initial warning target of 30MB for
per-session PGA utilization tended to be violated more and more often, and we
would constantly receive warning messages. As a result, we had to readjust the
warning threshold limit gradually so that we’d only get warned about
significant events.

“False positives” tended to be ignored over time.
When we turned on notification for the thresholds, we had a natural tendency to
pay strict attention to every warning we received. After a while, however, I
noticed that human nature kicked in: we grew used to receiving warning
messages, and we began to treat them a bit too lackadaisically. You can guess
the result: Once in a while, we’d simply put off following up on a warning or
critical alert, and by the time we investigated, it was too late to find out
the cause of the alert.

Oracle 10g Metrics Collection and Interpretation: A New Performance Model

This story is a prime example of how Oracle releases prior
to Oracle 10g handled the detection of potential problems in the database.
Since Oracle 10g has greatly expanded the number and types of wait events
from about 290 in prior releases to over 800, it makes it that much simpler to
isolate the exact causes of poor database performance. In fact, the increased
granularity of these wait events is crucial to the success of the automatic
advisor features found in Automatic Database Diagnostic Monitor (ADDM) and its
related Advisor framework.

Oracle 10g also improved the concept of database
thresholds because a majority of the issues that cause Oracle DBAs the most
grief are being monitored automatically right “out of the box.” For example,
Oracle 10g automatically monitors tablespace space utilization and will issue a
warning or critical message when the tablespace’s size has reached a threshold
of 85% or 97%, respectively. Other conditions monitored include the ratio of
the time the database is spending in a wait state vs. actually performing work,
the average number of seconds it takes to complete a transaction (i.e. response
time), and how much redo each DML operation is generating.

Another nice feature of Oracle 10g Release 1 (10gR1) is the
ability to create a performance baseline from metrics gathered over the
past days, weeks and months. Once I’ve established and saved such a baseline, I
can easily instruct Oracle 10gR1 to build thresholds based upon the highest
value for a particular metric (for example, the number of user commits per
second). I can also tell Oracle 10gR1 to build thresholds based on an
additional percentage beyond a baseline value – say, 125% of the number
of user commits per second for the baseline time period.

Even with these improved detection methods, however, Oracle
10gR1 still has two slight drawbacks. First, since it relies upon strict arithmetic
methods for detecting a threshold violation, Oracle 10gR1 can only notify me
when a threshold has been breached. The current value of user commits per
second may be hovering very near but just below the warning threshold value for
several hours, and I would never know that a potential problem is brewing.
Also, when a warning threshold is constantly being breached, I really have only
two choices: ignore the warning as a likely “false positive,” or
continually raise the threshold until I no longer receive the warning.

This conundrum illustrates that I really need to know two
different things whenever a threshold is breached:

If a current metric were compared to a threshold as of a
corresponding point in time, would it even register as a problem? For example,
a value of 300 user commits / second may be entirely normal during my
database’s nightly batch processing time frame between 7 PM and 10 PM, but
highly out of the ordinary between the hours 10 AM and 2 PM, when my online
transaction processing users are performing order entry tasks. To make this
even more complex, I may need to know this variance on an hour-by-hour
comparison for some metrics, while for others I’d like to know the variance
based on a day-by-day (e.g. Monday vs. Tuesday) or weekday vs.
weekend comparison.
Is the current level of this metric truly significant?
Continuing the same scenario, do I really need (or desire!) to be notified
whenever the number of user commits per second barely exceeds some
relatively arbitrary threshold? And just as important, wouldn’t I want to know
when a metric value is just below that same threshold for a significant
period of time?

Oracle 10gR2 solves these issues with a new concept called adaptive
thresholds. In a nutshell, an adaptive threshold measures the difference
between the current value for a metric and its corresponding
threshold as of a particular, yet relative, point in time. Once
Oracle determines that a threshold breach has occurred, it also measures the significance
of the breach in relation to the threshold as a deflection from a nominal
value.

Establishing Metric Baselines

Adaptive thresholds rely upon metric baselines created from
representative samples of metrics taken over a time frame. Oracle will only
allow the creation of metric baselines if the data collected has enough
significance, but this can be resolved by altering the time frames or
granularity of the metrics collected. Metric baselines are further divided into
two types: moving window baseline periods and static baseline periods.

Moving Window Baseline Periods. As its name implies,
a moving window baseline period is based upon a rolling window of time that
extends a minimum of seven days into the past. Oracle also allows the creation
of moving window baseline periods that extend 7, 21, 35, and even 90 days into
the past. These baselines are most appropriate when the system or application
whose performance I am measuring have cyclical workloads that are relatively
predictable.

For example, a hybrid database typically has well-defined
activity cycles. Online transaction processing (OLTP) activity may occur during
the morning through the early evening, followed by batch processing during the
evening, and then followed by exports and backups in the early morning hours.
In this case, the database’s performance will follow a regular “heartbeat”
pattern that repeats every business day, while the weekday vs. weekend
performance pattern may be dramatically different.

Static Baseline Periods. Static baseline periods, on
the other hand, consist of a well-defined time period for which I want to
gather and preserve statistics for later comparison. I can create a static
baseline period based on any two points in time – as long as sufficient data
for the metrics exist. For example, I may want to capture the performance of my
database during the last seven to ten days of the previous calendar quarter as
a static baseline so that I can compare that performance during the upcoming
quarter-ending period, thus monitoring the database for any significant
deflections from the norm.

Time Groups. Regardless of the type of baseline
period selected, Oracle 10gR2 permits me to group data within different time
periods known as time groups. This insures that thresholds have
sufficient significant data for metric baseline creation. Time groups are
broken down into several different patterns as illustrated in Table 1.
Once metric baselines data have been gathered, they can be distributed within a
combination of these daily and/or weekly grouping schemes to insure that there
is enough significance for applying adaptive threshold monitoring.

Table 1. Time Groups
Daily Groupings Available:
Time Grouping	Statistics Grouped Within
Hour of Day	Metrics are aggregated within every hour of the day. Most useful when there are significant variations to be tracked within individual hours of every day of the week
Day and Night	Metrics are aggregated within two time periods: daytime (0700 – 1900) and nighttime (1900-0700). Most useful when obvious differences in system performance need to be monitored between day and night
All Hours	Metrics are aggregated within each day of the week. Best used when no discernable pattern is obvious between daytime and nighttime performance
Weekly Groupings Available:
Time Group	Statistics Grouped Within
Day of Week	Metrics are aggregated within every day of the week. Most useful when there are significant variations to be tracked within individual days of the week
Weekdays and Weekends	Metrics are aggregated within two ranges: Monday thru Friday and Saturday thru Sunday. Most useful in differentiating between normal business or work days vs. weekend periods
All Days	Metrics are aggregated for the week itself. Best used when no discernable pattern is obvious between individual days, or weekends vs. weekdays

Adaptive Thresholds: It’s All About the Deflection

Once I have set up the appropriate baseline periods and
defined the appropriate time group to insure sufficient significance, I can set
adaptive thresholds based on one of two measurement methodologies: percent
of maximum thresholds or significance level thresholds.

Significance Level Thresholds. These are percentile-based
and measure the deflection from the nominal value based upon one of four
possible levels of deviation from the norm. A significance level threshold
therefore helps to eliminate false positives while highlighting those truly
significant variances from an expected value. Table 2 lists the
available significance levels:

Table 2. Significance Level Thresholds
Significance	Percentile	Occurrences
High	0.95	5 in 100 occurrences
Very High	0.99	1 in 100 occurrences
Severe	0.999	1 in 1,000 occurrences
Extreme	0.9999	1 in 10,000 occurrences

Percent of Maximum Thresholds. This type of threshold
setting uses a simple arithmetic calculation — the percentage of deflection
based on the value at the 99^th percentile for the selected metric
baseline – to determine if a threshold has been violated significantly.

Tracked Thresholds. Of the several dozen metrics that
Oracle 10g provides across the entire database, it is important to note that
Oracle 10gR2 only provides the ability to track adaptive thresholds for 15 of
the most meaningful, broken down across three areas of focus:

Performance metrics track three indicators of the amount
of effort the database is expending to provide the current level of
performance.
Workload Volume metrics focus on eight of the most
meaningful indicators of how much work the database is performing.
Workload Type metrics depend on four indicators that
describe how the workload is distributed within the database’s overall
performance.

Table 3 lists the 15 metrics that support adaptive
thresholds, the rates and ratios being tracked, and what each metric indicates:

Table 3. Adaptive Thresholds Tracked
Performance Metrics
Threshold	Rate	Description
Database Time	Centiseconds / Second	The amount of time the database is spending doing actual work, i.e., not in idle status or spent “spinning”
Response Time	Per Transaction	How much time it takes to complete per each logical transaction
System Response Time	Per Second	How well the system is responding to requests
Workload Volume Metrics
Threshold	Rate	Description
Executions	Per Second	How many statements have been executed
Redo Generated	Per Second	How much redo transaction volume has been generated
Network Bytes	Per Second	How many bytes are being transmitted across the network
Physical Writes	Per Second	How many physical block writes have occurred
Physical Reads	Per Second	How many physical block reads have occurred
Current Logons	Count	How many user sessions have logged into the system
User Calls	Per Second	How many user calls have been processed
Number of Transactions	Per Second	How many transactions have been processed
Workload Type Metrics
Threshold	Rate	Description
Database Blocks Changed	Per Transaction	How many database blocks have been modified. A good indicator of how much DML vs. queries have occurred
Enqueue Requests	Centiseconds	How long a transaction has been waiting (e.g. for a row lock to clear)
Total Parses	Per Transaction	How many times a SQL statement has had to be parsed, either soft (already existing in the Library Cache) or hard (not found in the Library Cache)
Session Logical Reads	Per Transaction	The average number of blocks that have been read from the database buffer cache – a good indicator of how many blocks are being read per each logical unit of work

Next Steps

In Part 2 of this series, I’ll provide actual examples of
how to create metric baselines using both static metrics and rolling time
periods and then demonstrate how to configure adaptive thresholds using these
metric baselines so that you can elevate your Oracle 10gR2 database’s warning
and critical thresholds to new heights of sensitivity and meaningfulness.

References and Additional Reading

Even though I have hopefully provided enough technical
information in this article to encourage you to explore with these features, I
also strongly suggest that you first review the corresponding detailed Oracle
documentation before proceeding with any experiments. Actual implementation of
these features should commence only after a crystal-clear understanding exists.
Please note that I have drawn upon the following Oracle 10gR2 documentation for
the deeper technical details of this article:

B14214-01 Oracle 10gR2
Database New Features Guide

B14231-01 Oracle 10gR2
Database Administrator’s Guide

B14258-01 Oracle 10gR2 PL/SQL
Packages and Types Reference

»

See All Articles by Columnist Jim Czuprynski

Jim Czuprynski

Jim Czuprynski has accumulated over 30 years of experience during his information technology career. He has filled diverse roles at several Fortune 1000 companies in those three decades - mainframe programmer, applications developer, business analyst, and project manager - before becoming an Oracle database administrator in 2001. He currently holds OCP certification for Oracle 9i, 10g and 11g. Jim teaches the core Oracle University database administration courses on behalf of Oracle and its Education Partners throughout the United States and Canada, instructing several hundred Oracle DBAs since 2005. He was selected as Oracle Education Partner Instructor of the Year in 2009. Jim resides in Bartlett, Illinois, USA with his wife Ruth, whose career as a project manager and software quality assurance manager for a multinational insurance company makes for interesting marital discussions. He enjoys cross-country skiing, biking, bird watching, and writing about his life experiences in the field of information technology.

Oracle 10gR2 Adaptive Thresholds, Part 1: Overview

Oracle 10g Metrics Collection and Interpretation: A New Performance Model

Establishing Metric Baselines

Table 1. Time Groups

Daily Groupings Available:

Weekly Groupings Available:

Adaptive Thresholds: It’s All About the Deflection

Table 2. Significance Level Thresholds

Table 3. Adaptive Thresholds Tracked

Performance Metrics

Workload Volume Metrics

Workload Type Metrics

Next Steps

References and Additional Reading

Jim Czuprynski

Company

Categories