When we were in the process of implementing RMAN backups for
our databases, we felt that there were not many articles or real stories on
how someone survived a real disaster. This article walks through a real
disaster we recently had in our company and how we survived it using RMAN
backups. This article explains the disaster scenario, how we recovered our
databases using RMAN backups and the lessons learned in this process.
Disaster Scenario:We have over 100 Oracle databases varying from
version 8.1.7 to 9.2.0 spread out on 45 HPUX boxes. The oracle software and the
database files on all these boxes were attached to a SAN. Our SAN was
configured with a 7+1 raid. We lost eight physical drives in SAN that turned
out to be 104 logical volumes. Our first step was to find out what boxes and
databases were affected.
Understanding and Assessing
the Disaster: The first symptom that we saw was that the
load on the UNIX server had started to climb very rapidly. We saw loads around
40 when we did a top on the server. As DBAs we always relied on the alert log
to pick up any block error when the application accessed the corrupt data
block. However, for some reason, the alert logs were not reporting any error.
We started getting calls that the application was getting block corruption
errors and that’s when we realized that some databases were affected. Therefore,
the only way to check which databases were affected was to run a dbverify on
all of the database files. The dbverify was a CPU intensive process and it ran
for quite a while checking the database files. In addition to that, we also
rebooted our whole SAN infrastructure to make sure we caught all the disk
errors. After running the dbverify, we found that we needed recovery on six
databases as we had lost one full file system on the box. There were quite a
few boxes where we lost the oracle software and we decided to restore them from
our backups.
RMAN Backups: We run weekly RMAN full backup, either hot or
cold, for all of our databases. We are using Tivoli Storage Manager as our
Enterprise Backup solution. The RMAN backups went directly to tape using Tivoli
Data Protection for Oracle. All of our RMAN backups were identified by separate
tag names. The oracle software was backed up as a regular TSM backup.
Recovery using RMAN backups:
We had to go through different types of recovery to recover these six
databases.
Control file and Data File Recovery: The first database had lost three index data files and all control files. We restored the control file from the latest backup. Then we restored the particular data files that reported as corrupt. The last step was to recover the database. The recover database checked for the archive logs in the disk and if they not available it automatically restored the archive logs that were needed. We had developed our own recovery scripts and made sure we had tested them with the different scenarios. So all we did was to run those scripts, sit back and monitor the log files.
Log Based Recovery: The second database had lost one data file. We
thought it would be nice to try time-based recovery. However, for some reason,
time based recovery would not pick up the backup that we wanted it to. Therefore,
we ended up doing a log sequence based recovery. We got the last log sequence
number from the alert log before the database became corrupt and specified it
as part of the recovery script. It worked like a charm and we were able to
recover the database.
Tag name Based Recovery:
We had a third interesting scenario. The database needed the backup prior to
our latest backup as the application needed that. We thought the easiest option
was to do a time-based recovery and we gave the time immediately after the
backup had completed. However, for some reason RMAN would restore only the
latest backup. The other option was to recover until the sequence number that was
created. This option did not work either as there were considerably less log
switches to perform and the restore would pick up the backup prior to what we
needed or backup after what we needed. The only option in this case was to
restore using the tag name. This helped us to go to the exact backup, as we had
a standard tag name for all our backups.
Full Database Recovery:
The fourth scenario was a full database recovery where we had to run a full
database restore of a 40GB database. The initial approach we took was to divide
this into three separate restores, first to restore all the system datafiles,
second to restore data datafiles and the third to restore the index datafiles.
We started these three restores in parallel hoping that the restore will
complete quickly. After these restores had been running for three hours, we
realized that the restore goes through every backup piece to restore a
particular data file. We had about seven backup pieces each having a size of
2GB. While one session examined the backup piece, the other session was in a
media wait, as it required the same backup piece used by the other session. Finally,
we ended up canceling the three sessions and just restored the full database.
In this second attempt, we increased the number of channels for the restore
from two to four. As a result of these changes, we were able to restore and
recover the 40 GB database log in one hour by applying the archive. .
Previous Incarnation
Recovery: The fifth recovery scenario was another challenge.
After restoring and opening the database with resetlogs, we realized that we
still had some corrupt datafiles in the database. The solution was to go back
to a previous incarnation and restore the database. We followed the
documentation by setting RMAN to the previous incarnation and started the
restore. To our surprise, it worked without any issues. It went to the previous
incarnation, restored the datafiles and recovered the database.
Cataloging Archive logs:
Last but not least, we ran into a scenario where the RMAN catalog did not
understand archive logs that were created after we had run the RMAN backup. We
had to do manual cataloging of all the archive logs that were created after the
backup. As the number of archive logs that were created was less, we were able
to catalog the archives quickly and proceeded with the recovery of the
database.
Lessons Learned:Run periodic re-sync of the target database with
the catalog database, preferably twice a day so that all new archive logs
created will be tracked in the catalog database. This applies to databases that
are in archive log mode and that create a large number of archives. Time based
recovery does not work as properly as it should. It always restores the latest
control file backup instead of the control file that was backed up around the
time of the restore. The next lesson was for the Tivoli Data Protection for Oracle.
The version we were using, TDP 2.2.1 would not restore backup pieces that were
2GB in size. The solution was to upgrade to TDP 5.2. We were lucky to test it
out one week prior to the disaster.
We wished we had:OEM can be used as a GUI tool to restore RMAN
backups from disk. However, if the backups are going to tape as was in our
case, there is no GUI tool available to restore the database. Everything needs
to run through either scripts or using command line commands. The second item
on the wish list is the dbverify program. We had to run dbverify for all six
databases in the box to determine which datafiles had been affected. It is a
very CPU intensive process and we wished it had a parallel option so that it
could have run in parallel.
Conclusion: We
always make sure we have regular backups for our databases. However, the real
success of our backups depends on how effectively we were able to restore the
databases from these backups. Develop recovery scripts and make sure to test
and practice the different recovery scenarios. We even documented a restore
cookbook for all the different recovery scenarios. After a new incarnation of a
database is created, make sure to run a full backup of the database. Also, make
sure to run a full export of the database once the database is recovered. It
will go through all the database blocks and is a good way to check whether
there are still corrupt datafiles.
It was a very tough decision for us to let go of our regular
operating system backup and switch to RMAN. Now that we have survived a
disaster, we are confident that we have made the right decision in switching to
RMAN backup. Our hats off to RMAN backup and the way it worked as it is
supposed to.