Don’t Just Do Something, Stand There! Avoiding Junior DBA Mistakes

Synopsis. When an Oracle database suddenly
becomes unavailable, the immediate reaction is to do something — right now! —
to solve the problem, which sometimes makes the problem worse. This article
discusses some examples of when it’s really best to stop and think before
proceeding.

Military veterans I have spoken with have often described
their combat experiences as days and weeks of monotony and boredom punctuated
by minutes and seconds of sheer terror. While I have hardly ever been in physical
danger as an Oracle DBA, I have certainly experienced some white-knuckled
terror when a database has suddenly become unavailable – usually at a most
inopportune time for my clients.

As Oracle DBAs, our immediate reaction is to do something
right now! – to solve the problem. Moreover, it usually does not help
when your supervisor is standing behind you asking what’s wrong, and how long
will it be until the database is available again. While doing something
immediately is certainly a noble goal, in many cases it is just the opposite
of the proper action. Here are a few examples:

The Slow Reboot

I was called out of a meeting when a fellow IT employee
reported that many of our users were having difficulty accessing our company’s
primary order entry application. Like any good DBA, I made a beeline for the
server room to check on the status of the database server. What I found sent my
heart racing: the database server had rebooted itself for some reason from a
Windows 2000 “blue screen of death” (BSOD) error, and in fact was still
rebooting when I had arrived on the scene.

Since I knew I had configured the Oracle database service
and instance to restart itself automatically in these circumstances, I headed
back to my desk to start checking on the status of the myriad applications that
depend on the database. However, after a long five minutes or so – more than
enough time for the server to reboot – we were still getting reports of
inaccessibility from our user community.

Back to the server room! I immediately checked the
database alert log and found an error message indicating that one of the disks
storing the database’s datafiles could not be located. Now my heart is really
pounding, and I am already thinking about which Recovery Manager backups would
have to be applied to restore the datafile, whether all the archived redo logs
I would need to recover were present on the server… and then I recalled a
similar incident when I had been testing the same “cold reboot” disaster
recovery scenarios earlier in the year.

Therefore, I just shut down the database instance,
restarted the service and attempted to restart the database. Almost magically,
everything was fine. Oracle found the datafile on the expected drive, the
database restarted just fine, and SMON performed a perfect instance recovery.
The database was back online in a few moments, and my heartbeat returned to
normal.

It turned out that during the server reboot, the Windows
2000 operating system had just fallen a bit behind in recognizing the “missing”
drive, and when it was time to mount the database, Windows told Oracle the
drive “wasn’t there.” Of course, by the time I performed a shutdown and
restart, Windows had finished making the disk drive available, and voila!
everything was once again copasetic. Our team eventually found the root cause
of the BSOD with some help from Microsoft. We traced it down to a Windows
driver conflict between the OS and SNMP-based software that is supposed to
monitor the server for hardware failures, including (ironically) disk drive
failures.

But what if I hadn’t resisted my urge to do something
right away to fix the problem? I could have made the situation much
worse by performing a time-consuming restoration and recovery in the midst of
our company’s busiest operational period, and possibly corrupted a datafile
that was not even damaged.

The Unredundant Redundant Power Supply

Our system administration team found out that our
hardware supplier had shipped all media storage cabinets without
redundant power supplies – including the four cabinets for our primary
production database server. They therefore decided to install redundant power supplies
for all our media cabinets. When they received the hardware, they began testing
the hot-swap capabilities of the power supplies, and reported that we should be
able to simply shut down our databases, swap in the redundant power supplies,
and power them up – we did not even need to power down the media cabinets
themselves. We scheduled our installation for late on a Friday evening to limit
disruption to our customers.

The Friday arrived, the database was shut down and we
began swapping in the new power supplies. As we powered up the last array, we
noticed the cooling fan for that array’s original power supply was running
noisily – never a good sign – so the system administrator pulled that power
supply to investigate. What we found knocked us off our feet: Three of the four
retention bolts for the fan had been sheared away during normal operation! Thankfully,
we had ordered spare cooling fans, too, so the system administrator ran to the
spares room to get one.

Throughout the process, I had been watching the arrays
intently for any signs of problems. While we were waiting for the spare fan to
arrive, the fourth array suddenly shut itself down even though the
“redundant” power supply was still operating.
(We later found out this was
a safety feature built into the media cabinet to prevent potential data loss.)

Since this media cabinet contained the mirrored drives
for its companion above it, the OS suddenly got very upset, and began to claim
that several disk drives were now unavailable to the server – drives that
contained half of the database’s datafiles.
When the second power supply
was reinstalled, the drives went into immediate RAID recovery mode, and it
appeared that they would be available again for use within four hours.

This was in my early days as an Oracle DBA, so I had
never experienced a potential failure like this before. My immediate concern
was that the database would be in an inconsistent state or completely shredded,
and my immediate reaction was to activate our Oracle 9i DataGuard standby
server and pick up the pieces the next morning. However, my system
administrator stopped me just in time and explained that the drives would be
just fine once they had completed synchronizing themselves with their mirrored
“twins.” Since we had no pressing need for the database until Saturday morning,
I decided to wait out the media recovery period.

Once again, everything was fine. The Windows OS
recognized the drives, I was able to restart the database, and still get home
in time for a late dinner. However, if I had gone with my original plan to
activate the DataGuard standby database, I would have wasted an enormous amount
of effort for what was a trivial issue in retrospect.

Jim Czuprynski
Jim Czuprynski has accumulated over 30 years of experience during his information technology career. He has filled diverse roles at several Fortune 1000 companies in those three decades - mainframe programmer, applications developer, business analyst, and project manager - before becoming an Oracle database administrator in 2001. He currently holds OCP certification for Oracle 9i, 10g and 11g. Jim teaches the core Oracle University database administration courses on behalf of Oracle and its Education Partners throughout the United States and Canada, instructing several hundred Oracle DBAs since 2005. He was selected as Oracle Education Partner Instructor of the Year in 2009. Jim resides in Bartlett, Illinois, USA with his wife Ruth, whose career as a project manager and software quality assurance manager for a multinational insurance company makes for interesting marital discussions. He enjoys cross-country skiing, biking, bird watching, and writing about his life experiences in the field of information technology.

Latest Articles