Becoming the Master of Disaster

A few
Saturdays ago, I performed a planned viability test of my Oracle 9iR2 hot
standby database. I terminated the transmission of archived redo logs from the
primary site, activated the standby database, and compared results between
primary and standby sites. As expected, row counts, dollar totals, and a few
other measures matched up perfectly. Satisfied, I kicked off the process to
copy RMAN backups from the primary site in preparation for restoring the
standby site to its standby role, and went home until Monday, leaving the newly
activated database running over the weekend.

When I
arrived early on Monday to run a few more tests of our applications against the
standby site, I was surprised to discover the instance had crashed. After
investigation, I found out that all of the drives had failed on one of
the standby site’s two disk drive arrays. Since that array held drives that
contained datafiles for the system rollback segments, the rollback segment tablespace
was corrupted almost immediately. Further investigation revealed that the disk
array had failed because the array had only one power supply, even though a
second redundant power supply module could have been installed.

Even though
this was a rather unexpected and reasonably unlikely failure, it could not have
come at a better time. It caused me to review our entire disaster recovery plan
for both the secondary and primary servers. I found out that none of the
production servers had been outfitted with redundant power supplies for the
disk arrays. And some further reevaluation of my disaster recovery scenarios
proved that the loss of one of the arrays would have caused the loss of UNDO
segments on the production database – because it turns out those datafiles
weren’t mirrored properly either.

Both the
primary and standby sites are all repaired now, of course, and everything is
copasetic. However, my cautionary tale underlines how a robust disaster
recovery plan can be critical in preventing and surviving a potential disaster.

Developing
disaster recovery scenarios.

A good disaster recovery planner isn’t afraid to “think
about the unthinkable.” This entails developing the common disaster recovery
scenarios that could happen to your database and server.

Based on my
experiences over the past several years as an Oracle DBA, the most serious of
these is media failure. A typical example of preventable media failure
involves under-utilization of RAID-0+1 or RAID-1 redundancy for critical data
files, log groups, and control files. Moreover, as I described in my earlier
tale of woe, it is a good idea to remember those pesky and often-overlooked
UNDO or rollback segments – it may be impossible to restart the database when
those tablespaces are damaged or corrupted due to media failure.

Another set
of disaster recovery scenarios with serious implications involves the partial
or complete loss of the database server
itself. This might include damage
to the software needed to run the Oracle instance – for example, the loss of
critical operating system files – as well as physical damage, such as a failed
power supply, memory, or CPU module.

Hardware
disasters can be more difficult to predict, and can be even harder to test,
since realistically a “test to destruction” of the hardware might have to be
performed to simulate some of the failures. However, even with robust modern
service agreements available from major hardware suppliers, it could be hours
or even days before the damaged server is repaired and ready to take the load
of a production database again, so these scenarios should not be ignored.

Once you’ve
uncovered potential single points of failure and have painted some grim
pictures as to what might happen if those failures occurred, it’s time to turn
attention to the methods, practices, and hardware configurations that help
prevent a disaster.

Alternate
production server.

If you are using Oracle’s DataGuard facilities to create and maintain
either a logical or physical “hot standby” server site, then you’ve already got
this angle covered. However, if you do not have an alternate server to which
you could quickly restore your production database, the ability to recover from
a serious hardware disaster will be much more in doubt.

One less
robust alternative to a standby site is a quality-assurance (QA) database
server
. This server should ideally be a close match to the hardware for the
production site to allow evaluation of the next set of application or database
changes about to be released to production. On one occasion before getting our
hot standby server in working order, I was forced to transfer our production
database over to our QA site because we had noticed some “flaky” performance of
the production server. As it turned out, we had guessed right – the production
server’s motherboard was facing an imminent failure, and failed shortly after
the transfer of responsibility. Though the QA server had only half the memory
and CPU power of the production site, having a QA server in my “back pocket”
saved the day.

Jim Czuprynski
Jim Czuprynski
Jim Czuprynski has accumulated over 30 years of experience during his information technology career. He has filled diverse roles at several Fortune 1000 companies in those three decades - mainframe programmer, applications developer, business analyst, and project manager - before becoming an Oracle database administrator in 2001. He currently holds OCP certification for Oracle 9i, 10g and 11g. Jim teaches the core Oracle University database administration courses on behalf of Oracle and its Education Partners throughout the United States and Canada, instructing several hundred Oracle DBAs since 2005. He was selected as Oracle Education Partner Instructor of the Year in 2009. Jim resides in Bartlett, Illinois, USA with his wife Ruth, whose career as a project manager and software quality assurance manager for a multinational insurance company makes for interesting marital discussions. He enjoys cross-country skiing, biking, bird watching, and writing about his life experiences in the field of information technology.

Latest Articles