June 10, 2009
It seems there is no end of tools or ways to monitor a database. Monitoring tools range from the expensive, high-end see-all be-all variety to homegrown, simplistic shell scripted ones. Before pursuing one or more approaches (and before spending money), identify what it is you want to accomplish. What is it you want to monitor, how that information should be surfaced, and by what means do you want to capture events are some of the questions for which you should have answers.
Categorize the severity
Starting with the most severe errors, pretty much anything that prevents users from connecting and completing transactions is a showstopper, and is something you should be notified about immediately. On the other hand, the low end may include events of interest, which are just above the trivial. The stuff in between? Maybe they are events you want to know about, but are not of such urgency that direct human interaction is required right then and there.
To formalize these categories, a set of tiers can be used.
Tier 1 events are characterized by anything having an immediate impact on the business and typically involve production systems. Instance and media failure are examples of where the system (the database proper) is unavailable. The instance is not running nor is the database open. Whatever the reason, users and processes cannot connect.
Tier 2 events have a limited impact, or are of a nature where no immediate response may result in an escalation to Tier 1. A check disk space or tablespace full type monitoring process can send a warning when a threshold value is reached. You want to know about an event in this category, and when you do become aware of it, you tend to have ample time to rectify the situation.
Tier 3 events have none to little impact on the business as a whole, and may be events related to after the fact. These are events you can wait to find out about when you get to work in the morning. Its not that these are inherently trivial, but theres practically nothing you could have done to prevent them in the first place, and when they do occur, life generally goes on. An example is the snapshot too old error. It could have been a one-time thing because of a users bad query being run late at night.
However, there are many other situations where non-database killing events qualify for Tier 1. Many of them are related to space issues. Other events can span tiers, depending on what is taking place. A failed database job is an example of this crossover. A once a month job that fails, but can take place later (such as the next day) is Tier 2, but if the job was supposed to run a critical process and it failed or ran late, youd want to know right away.
Speaking of time, the checks for Tier 1 events needs to be frequent. Further, if the monitoring system fails and that is your only check mechanism, then you are blinded to extinction level events within the database. Put another way, what monitoring system do you use to check the monitoring system? For the homegrown approach, its hard to beat the reliability of multiserver crosscheck. For example, to check if database A is available on server A, a 5-minute between checks interval on server A can be scheduled via a cron job. From server B however, you can use a 15-minute check interval against database A.
Checks in the other tiers can be performed less frequently. Once a day, or at the end of the event are two epochs. An example of something I use, which is based on these epochs, is a check on the completion of RMAN backups. A (Veritas) scheduled backup of datafiles for a particular database runs during a window of 0700 to 1000 every day. A simple script to scrape the last line of the spooled log file, testing for certain keywords, can trigger an alert in the form of an email if the job did not finish or failed. At the end of the day, a last check of every databases RMAN log files (datafile backup and archive log sweep) can be consolidated and sent via email.
What are some of the show stopping events, where something other than instance or media failure is preventing normal operations to take place?
Max segments reached in the dark ages, max extents started at 505, and now the typical limit is unlimited. If you migrated from dictionary managed to locally managed tablespaces, the max extents stayed at whatever was being used in the DMT. You can still get bit by max extent errors depending upon how you migrated and upgraded.
Out of space running out of space in an archived redo log destination is an attention getter as the database hangs until the space issue is resolved. This lasts forever until rectified. Other out of space events may cause operations to hang, but are not fatal (as they used to be) if using resumable space allocation and you allow enough time to address the problem.
Invalid objects if an object goes invalid, its because a user (to include a DBA) did something to invalidate it, but it could also be due to an Oracle-generated reason. A statistics gathering job, that is, something normal taking place within Oracle, may generate a self-inflicted event. However, if something else went horribly awry, you would want to know why there are scores of invalid objects all of a sudden.
Distributed transactions a distributed transaction failure could represent something serious or something inconvenient. In the serious case, connection to a remote server may have been lost, so what happens on your server? If it was based on a users ad hoc query, well, try again later--but if it involved DML, something critical could have failed. This one is kind of a wild card due to what was taking place at the time of the error.
Alert log checks probably one of the more classic checks is to run a difference between tail end copies of the alert log, and check for differences, especially where the new version has ORA in it. This follows that ORA errors raised (written) to the alert log are of relative importance to the database. Whats different here is that the alert log in and of itself is obviously not a show stopping event, but is the collector of serious errors.
The People Side of Monitoring
In a multiple DBA shop where the on-call responsibility rotates, alerts about Tier 1 events are directed to the on-call DBA. As to whether or not you want all DBAs to get a text/SMS alert sent to cell phones is up to you, but consider what happens if the primary on-call person cannot be reached. Tier 2 events can be emailed to the group, and notification about Tier 3 events may go to the principal owners of a database. It is a hugely self-defeating system where an on-call DBA receives an alert, but has to call and wake up the owner of the problem database when a routine operation needs to be performed especially when it is something that could have been taken care of during business hours the day before, or was entirely avoidable in the first place. Likewise, what is the point of responding to a transient invalid object check that resolves itself, but your monitoring system sends out an alert about the event, and it is an alert that requires human interaction such as rolling the alert log? Are you really doing your job, or are you building in things to make you look busy or important?
Monitor what is important to you and keep the amount of nice to know but not essential notifications to a minimum. When you are paged in the middle of the night, it should be for something serious (and fun to fix) as opposed to you having to react to mundane car alarms. Sometimes, the simplest solution is the best solution. If building in monitoring via crond scripts on a server, consider keeping a running log file of the alert. The benefit of using a robust monitoring system is its ability to maintain a history of notification events. Documentation by email is a poor way to maintain a central repository about the history of an event.