Hands-on Oracle: Backup & Recovery Games
March 10, 2008
Games in Oracle? No, the games reference here is like the games in Olympic Games. The athletes competing in games at that level are serious about what they do, and you can pretty much bet the farm they are passionate about their sport or event as well. As an Oracle DBA, how serious and passionate are you about your job? Is job even the right word here? For serious practitioners, profession is probably a better choice. The games or events were going to play, more specifically, what we learn from them, in this article are practical and necessary skills a DBA should have at his or her disposal.
Loss of one or more special files
Special files in our world are the following: control files, online redo logs, archived redo logs, and datafiles. Lets say you have several control files, identified (of course) in the initialization parameter file (Ill use a pfile for this article for quicker editing). Youve read about it a million times: if you lose a control file, the recovery or fix consists of copying the (or a) good one into the errant files location, or remove the reference to the file from the parameter file. After a control file is lost (for whatever reason), what happens to the database?
1. Operations can still take place, having other control files online means you can survive the loss of one or more control files
2. You (the DBA) have to shutdown, restore the file (or remove it from the init.ora file) and startup
3. Oracle shuts down/the instance is terminated (i.e., you have no control over that; its happening whether or not you like it)
The reason I ask is this: what is the difference between control files and online redo logs in terms of multiplexing them? For instance, what happens to the database (or instance) if a member of the active group is lost? Does Oracle keep going, or do things come to a screeching halt? In the following experiment, Ill place control03.ctl onto a flash drive, bounce the instance, and then pull the flash drive out of the USB port to simulate media failure.
Confirmation of the new control file:
SQL> startup ORACLE instance started. Total System Global Area 289406976 bytes Fixed Size 1290184 bytes Variable Size 104857656 bytes Database Buffers 176160768 bytes Redo Buffers 7098368 bytes Database mounted. Database opened. SQL> col name for a52 SQL> select name from v$controlfile; NAME ---------------------------------------------------- D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\CONTROL01.CTL D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\CONTROL02.CTL F:\ORACLE\CONTROL03.CTL
Now Im ready to disconnect the flash drive. With no direct warning (nothing raised to the user interface until an operation is attempted), the instance crashes.
SQL> select name from v$controlfile; select name from v$controlfile * ERROR at line 1: ORA-03113: end-of-file on communication channel
The alert log shows the following (plus much more; this is the relevant part):
Thu Mar 06 12:41:15 2008 Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_ckpt_2756.trc: ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: 'F:\ORACLE\CONTROL03.CTL' ORA-27072: File I/O error OSD-04008: WriteFile() failure, unable to write to file O/S-Error: (OS 1006) The volume for a file has been externally altered so that the opened file is no longer valid. some more messages Thu Mar 06 12:41:29 2008 Instance terminated by CKPT, pid = 2756
The fix part is easy, well known, and well documented elsewhere, but were you absolutely certain what actually happens when a control file is lost?
Coming back to online redo log files, is the loss of a member survivable, that is, does the instance crash like it did for loss of a control file? The answer is: it depends. A group can have one or more members. Lets look at #members = 1. Like the previous example, Ill create a group with a member on a flash drive, and make that group the active group (what is the meaning of STATUS for the values of ACTIVE, CURRENT, INACTIVE and UNUSED?). Once the group is active, Ill pull the plug. More than one way to do this, but what is the current redo log group? From the alert log:
Just like the loss of a control file, the instance is terminated.
SQL> select * from v$logfile; GROUP# STATUS TYPE MEMBER ---------- ------- ------- ------------------------------------------------ 7 ONLINE D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO07.LOG 8 ONLINE D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO08.LOG 10 ONLINE F:\ORACLE\REDO10.LOG 9 ONLINE D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO09.LOG SQL> select * from v$logfile; select * from v$logfile * ERROR at line 1: ORA-03113: end-of-file on communication channel
In this case, it was LGWR that was responsible for terminating the instance. Note that it was CKPT in the other example. What gets recorded in the control file? The SCN. Which process stamps the control file with the SCN? The checkpoint process.
If trying this at home, running the database is noarchivelog mode, and using only one member in the current group, what will it take to be able to open the database again?
Had that group been multiplexed, and the same file was pulled from the system, what happens? Life goes on, and the alert log is written to with a message about the missing member. Lets add a second member to group 10, placing it on a different (i.e., permanent) drive, and then pull the flash drive again (that should be a hint about what it takes to open the database).
Here is what the alert log shows:
Thu Mar 06 14:51:45 2008 Thread 1 advanced to log sequence 216 Current log# 10 seq# 216 mem# 0: F:\ORACLE\REDO10.LOG Current log# 10 seq# 216 mem# 1: D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO11.LOG Thu Mar 06 14:53:36 2008 Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2224.trc: ORA-00345: redo log write error block 23 count 2 ORA-00312: online log 10 thread 1: 'F:\ORACLE\REDO10.LOG' ORA-27072: File I/O error OSD-04008: WriteFile() failure, unable to write to file O/S-Error: (OS 1006) The volume for a file has been externally altered so that the opened file is no longer valid. Thu Mar 06 14:53:36 2008 Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2224.trc: ORA-00343: too many errors, log member closed ORA-00346: log member marked as STALE ORA-00312: online log 10 thread 1: 'F:\ORACLE\REDO10.LOG'
The best part about the messages is the lack of two words: Instance terminated. After a shutdown and startup, the alert log will still complain about the missing file, but the bottom line is that the database will be open.
Thu Mar 06 14:56:51 2008 ALTER DATABASE OPEN Thu Mar 06 14:56:52 2008 Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2288.trc: ORA-00313: open failed for members of log group 10 of thread 1 Thu Mar 06 14:56:52 2008 Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2288.trc: ORA-00313: open failed for members of log group 10 of thread 1 Thu Mar 06 14:56:52 2008 Thread 1 opened at log sequence 216 Current log# 10 seq# 216 mem# 1: D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO11.LOG Successful open of redo thread 1 some other messages Thu Mar 06 14:56:59 2008 Completed: ALTER DATABASE OPEN
Loss of datafiles
So far, recovering from loss of a control file or redo log file is pretty close to losing no data. The single member redo log group example is a case where data loss can take place. With respect to datafiles, there are two kinds: the ones that matter to you (your data) and the ones that matter to Oracle (its data). To clarify this categorization, here are two questions:
1. What happens if one your tablespaces loses a datafile?
2. What happens if the SYSTEM tablespace, as an example of an owned-by-Oracle tablespace, loses a datafile?
If you answered those questions as is, what was your frame of reference with respect to archiving? If you didnt consider archiving when formulating your answers, then you stand a 50-50% chance of having been wrong.
If operating in NOARCHIVELOG mode, and you lose a datafile any datafile its game over, instance terminates. If operating in ARCHIVELOG mode, then the answers to the two questions above are different: loss of your datafile means the instance keeps going, loss of an Oracle specific datafile means the instance is terminated. The section titled Recovering After the Loss of Datafiles: Scenarios in the Backup and Recovery Advanced Users Guide spells this out quite nicely.
Pretty much all of the backup and recovery scenarios related to loss of special files can be replicated on a PC using a very inexpensive flash drive. Whether on an AIX 5300L gazillion CPU machine with a gazillion megabytes of RAM or on your personal desktop/laptop that just barely runs Oracle, the principles are virtually the same. Give it a try and see for yourself. These are drills best practiced without the pressure of having to do these things when your instance is crashed and time is money.