Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

Oracle

Posted Jun 22, 2006

Oracle 10gR2 Adaptive Thresholds, Part 1: Overview

By Jim Czuprynski

Synopsis. Oracle 10g Release 2 (10gR2) has improved significantly the methodology for tracking performance metrics within the database. This article – the first in this series – discusses how adaptive thresholds are designed to improve the detection of threshold violations, including the ability to discern a false positive from a true warning.

I first started taking advantage of Oracle Enterprise Manager (OEM) events in Oracle 9iR2. I had a good reason, too: One of my client's applications was constantly malfunctioning and would regularly begin to use up almost all of the database's available Program Global Area (PGA) memory. As a result, other user sessions would have virtually no additional memory to perform sorting operations or, in a few extreme cases, block any further connections to the database.

Our team activated event tracking for per-session PGA memory usage by creating a custom user-defined event that Oracle 9i OEM monitored once every five minutes. We created a special package that contained various functions for monitoring which user sessions were using more than 30MB of PGA memory for longer than two monitoring periods (i.e. 10 minutes). Whenever OEM detected that a single application session was using excessive amounts of memory, it automatically sent an e-mail to our DBA team so that we could determine what the problem was.

This "distant early warning" system helped us immeasurably because we could respond to excessive PGA utilization in a reasonably timely fashion. However, after a few months of OEM monitoring, I noticed two interesting patterns:

The upper thresholds for the warning and critical conditions had to be constantly adjusted. As the load on our database server gradually increased over time, our initial warning target of 30MB for per-session PGA utilization tended to be violated more and more often, and we would constantly receive warning messages. As a result, we had to readjust the warning threshold limit gradually so that we'd only get warned about significant events.

"False positives" tended to be ignored over time. When we turned on notification for the thresholds, we had a natural tendency to pay strict attention to every warning we received. After a while, however, I noticed that human nature kicked in: we grew used to receiving warning messages, and we began to treat them a bit too lackadaisically. You can guess the result: Once in a while, we'd simply put off following up on a warning or critical alert, and by the time we investigated, it was too late to find out the cause of the alert.

Oracle 10g Metrics Collection and Interpretation: A New Performance Model

This story is a prime example of how Oracle releases prior to Oracle 10g handled the detection of potential problems in the database. Since Oracle 10g has greatly expanded the number and types of wait events from about 290 in prior releases to over 800, it makes it that much simpler to isolate the exact causes of poor database performance. In fact, the increased granularity of these wait events is crucial to the success of the automatic advisor features found in Automatic Database Diagnostic Monitor (ADDM) and its related Advisor framework.

Oracle 10g also improved the concept of database thresholds because a majority of the issues that cause Oracle DBAs the most grief are being monitored automatically right "out of the box." For example, Oracle 10g automatically monitors tablespace space utilization and will issue a warning or critical message when the tablespace's size has reached a threshold of 85% or 97%, respectively. Other conditions monitored include the ratio of the time the database is spending in a wait state vs. actually performing work, the average number of seconds it takes to complete a transaction (i.e. response time), and how much redo each DML operation is generating.

Another nice feature of Oracle 10g Release 1 (10gR1) is the ability to create a performance baseline from metrics gathered over the past days, weeks and months. Once I've established and saved such a baseline, I can easily instruct Oracle 10gR1 to build thresholds based upon the highest value for a particular metric (for example, the number of user commits per second). I can also tell Oracle 10gR1 to build thresholds based on an additional percentage beyond a baseline value – say, 125% of the number of user commits per second for the baseline time period.

Even with these improved detection methods, however, Oracle 10gR1 still has two slight drawbacks. First, since it relies upon strict arithmetic methods for detecting a threshold violation, Oracle 10gR1 can only notify me when a threshold has been breached. The current value of user commits per second may be hovering very near but just below the warning threshold value for several hours, and I would never know that a potential problem is brewing. Also, when a warning threshold is constantly being breached, I really have only two choices: ignore the warning as a likely "false positive," or continually raise the threshold until I no longer receive the warning.

This conundrum illustrates that I really need to know two different things whenever a threshold is breached:

  • If a current metric were compared to a threshold as of a corresponding point in time, would it even register as a problem? For example, a value of 300 user commits / second may be entirely normal during my database's nightly batch processing time frame between 7 PM and 10 PM, but highly out of the ordinary between the hours 10 AM and 2 PM, when my online transaction processing users are performing order entry tasks. To make this even more complex, I may need to know this variance on an hour-by-hour comparison for some metrics, while for others I'd like to know the variance based on a day-by-day (e.g. Monday vs. Tuesday) or weekday vs. weekend comparison.
  • Is the current level of this metric truly significant? Continuing the same scenario, do I really need (or desire!) to be notified whenever the number of user commits per second barely exceeds some relatively arbitrary threshold? And just as important, wouldn't I want to know when a metric value is just below that same threshold for a significant period of time?

Oracle 10gR2 solves these issues with a new concept called adaptive thresholds. In a nutshell, an adaptive threshold measures the difference between the current value for a metric and its corresponding threshold as of a particular, yet relative, point in time. Once Oracle determines that a threshold breach has occurred, it also measures the significance of the breach in relation to the threshold as a deflection from a nominal value.

Establishing Metric Baselines

Adaptive thresholds rely upon metric baselines created from representative samples of metrics taken over a time frame. Oracle will only allow the creation of metric baselines if the data collected has enough significance, but this can be resolved by altering the time frames or granularity of the metrics collected. Metric baselines are further divided into two types: moving window baseline periods and static baseline periods.

Moving Window Baseline Periods. As its name implies, a moving window baseline period is based upon a rolling window of time that extends a minimum of seven days into the past. Oracle also allows the creation of moving window baseline periods that extend 7, 21, 35, and even 90 days into the past. These baselines are most appropriate when the system or application whose performance I am measuring have cyclical workloads that are relatively predictable.

For example, a hybrid database typically has well-defined activity cycles. Online transaction processing (OLTP) activity may occur during the morning through the early evening, followed by batch processing during the evening, and then followed by exports and backups in the early morning hours. In this case, the database's performance will follow a regular "heartbeat" pattern that repeats every business day, while the weekday vs. weekend performance pattern may be dramatically different.

Static Baseline Periods. Static baseline periods, on the other hand, consist of a well-defined time period for which I want to gather and preserve statistics for later comparison. I can create a static baseline period based on any two points in time – as long as sufficient data for the metrics exist. For example, I may want to capture the performance of my database during the last seven to ten days of the previous calendar quarter as a static baseline so that I can compare that performance during the upcoming quarter-ending period, thus monitoring the database for any significant deflections from the norm.

Time Groups. Regardless of the type of baseline period selected, Oracle 10gR2 permits me to group data within different time periods known as time groups. This insures that thresholds have sufficient significant data for metric baseline creation. Time groups are broken down into several different patterns as illustrated in Table 1. Once metric baselines data have been gathered, they can be distributed within a combination of these daily and/or weekly grouping schemes to insure that there is enough significance for applying adaptive threshold monitoring.

Table 1. Time Groups

Daily Groupings Available:

Time Grouping

Statistics Grouped Within

Hour of Day

Metrics are aggregated within every hour of the day. Most useful when there are significant variations to be tracked within individual hours of every day of the week

Day and Night

Metrics are aggregated within two time periods: daytime (0700 – 1900) and nighttime (1900-0700). Most useful when obvious differences in system performance need to be monitored between day and night

All Hours

Metrics are aggregated within each day of the week. Best used when no discernable pattern is obvious between daytime and nighttime performance

Weekly Groupings Available:

Time Group

Statistics Grouped Within

Day of Week

Metrics are aggregated within every day of the week. Most useful when there are significant variations to be tracked within individual days of the week

Weekdays and Weekends

Metrics are aggregated within two ranges: Monday thru Friday and Saturday thru Sunday. Most useful in differentiating between normal business or work days vs. weekend periods

All Days

Metrics are aggregated for the week itself. Best used when no discernable pattern is obvious between individual days, or weekends vs. weekdays

Adaptive Thresholds: It's All About the Deflection

Once I have set up the appropriate baseline periods and defined the appropriate time group to insure sufficient significance, I can set adaptive thresholds based on one of two measurement methodologies: percent of maximum thresholds or significance level thresholds.

Significance Level Thresholds. These are percentile-based and measure the deflection from the nominal value based upon one of four possible levels of deviation from the norm. A significance level threshold therefore helps to eliminate false positives while highlighting those truly significant variances from an expected value. Table 2 lists the available significance levels:

Table 2. Significance Level Thresholds

Significance

Percentile

Occurrences

High

0.95

5 in 100 occurrences

Very High

0.99

1 in 100 occurrences

Severe

0.999

1 in 1,000 occurrences

Extreme

0.9999

1 in 10,000 occurrences

Percent of Maximum Thresholds. This type of threshold setting uses a simple arithmetic calculation -- the percentage of deflection based on the value at the 99th percentile for the selected metric baseline – to determine if a threshold has been violated significantly.

Tracked Thresholds. Of the several dozen metrics that Oracle 10g provides across the entire database, it is important to note that Oracle 10gR2 only provides the ability to track adaptive thresholds for 15 of the most meaningful, broken down across three areas of focus:

  • Performance metrics track three indicators of the amount of effort the database is expending to provide the current level of performance.
  • Workload Volume metrics focus on eight of the most meaningful indicators of how much work the database is performing.
  • Workload Type metrics depend on four indicators that describe how the workload is distributed within the database's overall performance.

Table 3 lists the 15 metrics that support adaptive thresholds, the rates and ratios being tracked, and what each metric indicates:

Table 3. Adaptive Thresholds Tracked

Performance Metrics

Threshold

Rate

Description

Database Time

Centiseconds / Second

The amount of time the database is spending doing actual work, i.e., not in idle status or spent "spinning"

Response Time

Per Transaction

How much time it takes to complete per each logical transaction

System Response Time

Per Second

How well the system is responding to requests

Workload Volume Metrics

Threshold

Rate

Description

Executions

Per Second

How many statements have been executed

Redo Generated

Per Second

How much redo transaction volume has been generated

Network Bytes

Per Second

How many bytes are being transmitted across the network

Physical Writes

Per Second

How many physical block writes have occurred

Physical Reads

Per Second

How many physical block reads have occurred

Current Logons

Count

How many user sessions have logged into the system

User Calls

Per Second

How many user calls have been processed

Number of Transactions

Per Second

How many transactions have been processed

Workload Type Metrics

Threshold

Rate

Description

Database Blocks Changed

Per Transaction

How many database blocks have been modified. A good indicator of how much DML vs. queries have occurred

Enqueue Requests

Centiseconds

How long a transaction has been waiting (e.g. for a row lock to clear)

Total Parses

Per Transaction

How many times a SQL statement has had to be parsed, either soft (already existing in the Library Cache) or hard (not found in the Library Cache)

Session Logical Reads

Per Transaction

The average number of blocks that have been read from the database buffer cache - a good indicator of how many blocks are being read per each logical unit of work

Next Steps

In Part 2 of this series, I'll provide actual examples of how to create metric baselines using both static metrics and rolling time periods and then demonstrate how to configure adaptive thresholds using these metric baselines so that you can elevate your Oracle 10gR2 database's warning and critical thresholds to new heights of sensitivity and meaningfulness.

References and Additional Reading

Even though I have hopefully provided enough technical information in this article to encourage you to explore with these features, I also strongly suggest that you first review the corresponding detailed Oracle documentation before proceeding with any experiments. Actual implementation of these features should commence only after a crystal-clear understanding exists. Please note that I have drawn upon the following Oracle 10gR2 documentation for the deeper technical details of this article:

B14214-01 Oracle 10gR2 Database New Features Guide

B14231-01 Oracle 10gR2 Database Administrator's Guide

B14258-01 Oracle 10gR2 PL/SQL Packages and Types Reference

» See All Articles by Columnist Jim Czuprynski



Oracle Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Oracle Data Mining: Classification jan.hasller 0 July 5th, 07:19 AM
Find duplicates - Unique IDs Lava 5 July 2nd, 08:30 AM
no matching unique or primary key rcanter 1 April 25th, 12:32 PM
Update values of one table based on condition of values in other table using Trigger Gladiator 3 February 29th, 06:01 PM


















Thanks for your registration, follow us on our social networks to keep up-to-date