Serving up Server Alerts
May 23, 2007
How and when do alerts or informational messages about whats taking place inside your database make their way out to you, the DBA par excellence? There are several ways, some free and some hand-crafted, to expose alerts and messages. Lets face it, as the Oracle RDBMS engine becomes more and more Skynet-Terminator-Judgement Day-aware, keeping track of whats taking place inside an instance has become easier and harder at the same time. The easier aspect of this statement is evidenced by more sophisticated monitoring tools and interfaces, and the harder part is borne out by the sheer number of metrics that are available to monitor.
Lets start off with a simple peek inside the database option.
Tail the Alert Log
A commonly used quick and dirty monitoring tool in UNIX-based environments (AIX, HP, Solaris, and Linux) is a simple script to tail X number of lines out of the alert log, and then search (grep) the extract for whatever is of interest to you. Specific ORA-xxxxx errors can be searched, or to make things even simpler, the search can be based on any ORA error. If an ORA error appears, then an email is fired off via a mail transfer agent (MTA) to one or more addresses.
The steps can be summarized by the shell script pseudo code below:
#! /usr/bin/ksh tail $ORACLE_HOME\bdump\alert_<SID>.log > alert.log COUNT=`grep ORA alert.log | wc -l` if [$COUNT is something other than zero or an empty string] then mail -s "Check alert log" email@example.com < alert.log fi
Several features need to be in place for this scheme. First, whomever (as a person or machine user such as oracle) is running, the script needs to have appropriate file system permissions to be able to read $ORACLE_HOME and write to wherever.
Second, your MTA can be as simple as mail (or mailx, depending on your flavor/version of UNIX). Chances are your UNIX admin already has UNIX mail working as no doubt much of his or her watchfulness is notification after the fact as opposed to scanning logs all day long (which is pretty much what this is for you as well).
Third, you need something to read mail yourself, so that implies something along the lines of Outlook/Exchange Server in your companys office. Assuming you have been assimilated by the Borg, oops, I mean Microsoft, then the email address shown in the example would stand out to those familiar with aliases or mail groups. Otherwise, have the script cat a file with email addresses in it and loop through the addresses.
Fourth, you need something to execute the tail job on a periodic basis as you are pulling the alert log information as opposed to it being pushed to you, and what better than a cron job to mange this aspect of the process. The cron can run every ten minutes (as an example) all week long. While crons are very reliable, what the job cannot do is guarantee you that it will catch an ORA error. One way to help ensure that your tail of 100 lines does not miss the ORA error at the 101st line (i.e., you missed it by one line) is to grab enough lines to increase the likelihood that the extract will contain at least the last ten minutes of alert log activity. Better to grab too much than not enough of the alert log.
As a variation on what is emailed to you, dont include the entire alert log extract in the DBA alert email. You only need a subject line telling you to inspect the alert log as opposed to sending (and waiting) multiple KB worth of text, especially if youre receiving email on a PDA while on call.
Check for Required Processes
A variation (or complement) of the alert log scan is an existence check for required processes. As a minimum, does the script need to check for PMON, SMON, DBWn, LGWR, and CKPT? The answer is not really checking for PMON by itself, as an example, is sufficient in and of itself. No PMON means no instance, which in turn means no running database (assuming a single instance/single database pairing).
Between an alert log scan and an instance checking is my database up script, the instance checking version is more of a superset of the alert log scan. Here is why this is so: is an alert log going to be written to if the instance is no longer running?
Or looked at this way, can an instance still be viable if it encounters or detects an ORA error? Yes it can, and a deadlock is an excellent example of this scenario. Deadlock detected, trace file info is written to the alert log, one sessions transaction is essentially cancelled, and life goes on because absolutely nothing is wrong with the database. Remember, Oracles philosophy on deadlocks is that when they do occur, it is because of something you caused via code, not something that is a shortcoming or error within Oracle.
Knocking down an instance by killing a required process typically generates alert log information, and can be easily demonstrated. On Windows, use the orakill utility to kill a SPID associated with a SID (kill -9 PID counterpart in UNIX). Use a query like the one below to obtain a SPID.
select c.name, b.spid, a.sid from v$session a, v$process b, v$bgprocess c where c.paddr <> '00' and c.paddr = b.addr and b.addr = a.paddr; NAME SPID SID ----- ------------ ---------- PMON 288 170 MMAN 536 168 DBW0 2596 167 LGWR 3936 166 CKPT 3252 165 SMON 3400 164 RECO 2432 163
Well use 288 (for PMON) as one of the parameters for orakill.
The alert log then records information about instance failure, and you can see the ripple effect among the trace files related to other processes (not all alert entries are shown)..
Tue May 22 01:46:24 2007 LGWR: terminating instance due to error 472 Tue May 22 01:46:25 2007 Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_ckpt_3252.trc: ORA-00472: PMON process terminated with error Tue May 22 01:46:26 2007 Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_dbw0_2596.trc: ORA-00472: PMON process terminated with error Tue May 22 01:46:31 2007 Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_reco_2432.trc: ORA-00472: PMON process terminated with error Tue May 22 01:46:31 2007 Errors in file c:\oracle\product\10.2.0\admin\db10\bdump\db10_smon_3400.trc: ORA-00472: PMON process terminated with error Instance terminated by LGWR, pid = 3936
Going beyond alert logs and background processes
We can get much more information about whats going on inside a database with the DBMS_SERVER_ALERT built-in PL/SQL package. In fact, more than 140 metrics are available, and the alert threshold values for many of these can be adjusted to suit your particular needs.
One alert or metric you may find to be useful involves the detection of blocking, the silent show stopper of Oracle. Blocking can go on for hours and hours with no discernible or externally noticeable signs of it taking place. Blocking is usually detected when users start to complain about hung sessions, followed by calls about not being able to log in, and when scripted jobs fail to complete (noticed by you or others). Aside from manually detecting blocking, wouldnt it be nice to be alerted when Oracle detects a blocking situation? In Oracle 10g, we can do exactly that.
One of the configurable metrics is for blocked user sessions, and it comes with its own graph. The Metric Value picture below is a result of the competing update statements shown in the SQL*Plus session windows (with an output of the blocking info below that).
Blocking is really quite insidious, and user sessions in an OLTP database can stack up in no time at all. From a customer service perspective, you can be certain your company would hate to have customers dissatisfied with your Web site that manages personal account information, mailing/shipping preferences, and any number of service oriented functionality. With server managed alerts, you can be one of the first to know about this situation as opposed to being practically the last to know.
In the next article about serving up server alerts, well go into detail about two ways to configure and manage server alert/metric settings: using the DBMS_SERVER_ALERT package and its GUI counterpart in Database Control.