Becoming the Master of Disaster - Page 2
April 22, 2003
As in most shops, we have designed our production backup scheme to run overnight during off-peak hours. We have the luxury of a relatively small production database (330GB) at about 20% utilization, so nightly Incremental Level 0 RMAN backups only consume about 45-50 GB of disk space, and they are completed in approximately 4 hours. However, this gives me extreme flexibility in rolling forward from a potential disaster, including point-in-time incomplete recovery via RMAN.
to RMAN backups.
media storage of backup files.
Another word about alternate media backups: Offsite storage is strongly recommended for at least some of the backup tapes. We currently send a complete set of backups off to a remote site once a week for vaulted archival with guaranteed turnaround of one hour for any particular tape (for a small fee, of course).
If you're having a hard time imagining why you'd ever need offsite storage for backups, here's a classic Oracle "urban legend" I heard at a recent seminar. A panicked DBA called Oracle for help because his production server had been destroyed when a truck backed up through his company's loading dock, which was on the other side of the server room. Part of the collapsed wall crashed down directly on top of the production server, destroying it. The DBA had an alternate server available, and had been backing up his database to tape.
Unfortunately, the backup tapes were stored - you guessed it - on top of the production server.
the disaster recovery plan.
After my experiences a few Saturdays ago, I reviewed all the media failure possibilities, including the loss of one or more datafiles containing SYSTEM, UNDO/rollback, index, and data segments. Then I constructed scenarios under which they might fail, and my expected course of action. Finally, I constructed methods to simulate the failure.
To simulate media failures of the various segment types, for example, I configured a RAID-0 drive on one of our development servers and then restored copies of a test database so that the appropriate datafiles were installed on that drive. While our QA manager simulated activity against that datafile by running application code that accessed that datafile's tablespace, I simply pulled that drive out of the disk array. I compared the expected results from the simulated failure against my expectations, and then attempted to restore and recover the damaged datafile using appropriate RMAN scripts.
I ran into some unexpected challenges with my initial attempts at RMAN recovery scripts, since some of the commands to rename and switch datafiles during restoration are slightly different from those used when restoring from "hot" or "cold" backups of datafiles and tablespaces. However, I have considered the lessons I learned during the evaluations of these scenarios to be invaluable, since I now have working examples of RMAN scripts for each specific scenario.
The result? I am now fully confident that in the worst-case scenarios of a partial or complete media failure of my production databases, I can easily restore and recover the appropriate datafiles from an RMAN backup set - something I do not ever want to have to do under the gun with one hand on the manual and one hand on the keyboard!
Jim Czuprynski is an Oracle DBA for a telecommunications company in Schaumburg, IL. He can be contacted at firstname.lastname@example.org.