Don't Just Do Something, Stand There! Avoiding Junior DBA Mistakes
April 15, 2004
Synopsis. When an Oracle database suddenly becomes unavailable, the immediate reaction is to do something -- right now! -- to solve the problem, which sometimes makes the problem worse. This article discusses some examples of when it's really best to stop and think before proceeding.
Military veterans I have spoken with have often described their combat experiences as days and weeks of monotony and boredom punctuated by minutes and seconds of sheer terror. While I have hardly ever been in physical danger as an Oracle DBA, I have certainly experienced some white-knuckled terror when a database has suddenly become unavailable - usually at a most inopportune time for my clients.
As Oracle DBAs, our immediate reaction is to do something - right now! - to solve the problem. Moreover, it usually does not help when your supervisor is standing behind you asking what's wrong, and how long will it be until the database is available again. While doing something immediately is certainly a noble goal, in many cases it is just the opposite of the proper action. Here are a few examples:
The Slow Reboot
I was called out of a meeting when a fellow IT employee reported that many of our users were having difficulty accessing our company's primary order entry application. Like any good DBA, I made a beeline for the server room to check on the status of the database server. What I found sent my heart racing: the database server had rebooted itself for some reason from a Windows 2000 "blue screen of death" (BSOD) error, and in fact was still rebooting when I had arrived on the scene.
Since I knew I had configured the Oracle database service and instance to restart itself automatically in these circumstances, I headed back to my desk to start checking on the status of the myriad applications that depend on the database. However, after a long five minutes or so - more than enough time for the server to reboot - we were still getting reports of inaccessibility from our user community.
Back to the server room! I immediately checked the database alert log and found an error message indicating that one of the disks storing the database's datafiles could not be located. Now my heart is really pounding, and I am already thinking about which Recovery Manager backups would have to be applied to restore the datafile, whether all the archived redo logs I would need to recover were present on the server... and then I recalled a similar incident when I had been testing the same "cold reboot" disaster recovery scenarios earlier in the year.
Therefore, I just shut down the database instance, restarted the service and attempted to restart the database. Almost magically, everything was fine. Oracle found the datafile on the expected drive, the database restarted just fine, and SMON performed a perfect instance recovery. The database was back online in a few moments, and my heartbeat returned to normal.
It turned out that during the server reboot, the Windows 2000 operating system had just fallen a bit behind in recognizing the "missing" drive, and when it was time to mount the database, Windows told Oracle the drive "wasn't there." Of course, by the time I performed a shutdown and restart, Windows had finished making the disk drive available, and voila! everything was once again copasetic. Our team eventually found the root cause of the BSOD with some help from Microsoft. We traced it down to a Windows driver conflict between the OS and SNMP-based software that is supposed to monitor the server for hardware failures, including (ironically) disk drive failures.
But what if I hadn't resisted my urge to do something right away to fix the problem? I could have made the situation much worse by performing a time-consuming restoration and recovery in the midst of our company's busiest operational period, and possibly corrupted a datafile that was not even damaged.
The Unredundant Redundant Power Supply
Our system administration team found out that our hardware supplier had shipped all media storage cabinets without redundant power supplies - including the four cabinets for our primary production database server. They therefore decided to install redundant power supplies for all our media cabinets. When they received the hardware, they began testing the hot-swap capabilities of the power supplies, and reported that we should be able to simply shut down our databases, swap in the redundant power supplies, and power them up - we did not even need to power down the media cabinets themselves. We scheduled our installation for late on a Friday evening to limit disruption to our customers.
The Friday arrived, the database was shut down and we began swapping in the new power supplies. As we powered up the last array, we noticed the cooling fan for that array's original power supply was running noisily - never a good sign - so the system administrator pulled that power supply to investigate. What we found knocked us off our feet: Three of the four retention bolts for the fan had been sheared away during normal operation! Thankfully, we had ordered spare cooling fans, too, so the system administrator ran to the spares room to get one.
Throughout the process, I had been watching the arrays intently for any signs of problems. While we were waiting for the spare fan to arrive, the fourth array suddenly shut itself down even though the "redundant" power supply was still operating. (We later found out this was a safety feature built into the media cabinet to prevent potential data loss.)
Since this media cabinet contained the mirrored drives for its companion above it, the OS suddenly got very upset, and began to claim that several disk drives were now unavailable to the server - drives that contained half of the database's datafiles. When the second power supply was reinstalled, the drives went into immediate RAID recovery mode, and it appeared that they would be available again for use within four hours.
This was in my early days as an Oracle DBA, so I had never experienced a potential failure like this before. My immediate concern was that the database would be in an inconsistent state or completely shredded, and my immediate reaction was to activate our Oracle 9i DataGuard standby server and pick up the pieces the next morning. However, my system administrator stopped me just in time and explained that the drives would be just fine once they had completed synchronizing themselves with their mirrored "twins." Since we had no pressing need for the database until Saturday morning, I decided to wait out the media recovery period.
Once again, everything was fine. The Windows OS recognized the drives, I was able to restart the database, and still get home in time for a late dinner. However, if I had gone with my original plan to activate the DataGuard standby database, I would have wasted an enormous amount of effort for what was a trivial issue in retrospect.