Games in Oracle? No, the
games reference here is like the “games” in Olympic Games. The athletes
competing in games at that level are serious about what they do, and you can
pretty much bet the farm they are passionate about their sport or event as
well. As an Oracle DBA, how serious – and passionate – are you about your job?
Is job even the right word here? For serious practitioners, profession is
probably a better choice. The “games” or events we’re going to play, more
specifically, what we learn from them, in this article are practical and
necessary skills a DBA should have at his or her disposal.
Loss of one or more special files
Special files in our world
are the following: control files, online redo logs, archived redo logs, and
datafiles. Let’s say you have several control files, identified (of course) in
the initialization parameter file (I’ll use a pfile for this article for
quicker editing). You’ve read about it a million times: if you lose a control
file, the recovery or fix consists of copying the (or “a”) good one into the
errant file’s location, or remove the reference to the file from the parameter
file. After a control file is lost (for whatever reason), what happens to the database?
1.
Operations can still take place,
having other control files “online” means you can survive the loss of one or
more control files
2.
You (the DBA) have to shutdown,
restore the file (or remove it from the init.ora file) and startup
3.
Oracle shuts down/the instance is
terminated (i.e., you have no control over that; it’s happening whether or not
you like it)
The reason I ask is this:
what is the difference between control files and online redo logs in terms of
multiplexing them? For instance, what happens to the database (or instance) if
a member of the active group is lost? Does Oracle keep going, or do things come
to a screeching halt? In the following experiment, I’ll place control03.ctl
onto a flash drive, bounce the instance, and then pull the flash drive out of
the USB port to simulate media failure.
Current listing of control |
|
New listing of control |
|
Confirmation of the new
control file:
SQL> startup
ORACLE instance started.Total System Global Area 289406976 bytes
Fixed Size 1290184 bytes
Variable Size 104857656 bytes
Database Buffers 176160768 bytes
Redo Buffers 7098368 bytes
Database mounted.
Database opened.
SQL> col name for a52
SQL> select name from v$controlfile;NAME
—————————————————-
D:ORACLEPRODUCT10.2.0ORADATADB10CONTROL01.CTL
D:ORACLEPRODUCT10.2.0ORADATADB10CONTROL02.CTL
F:ORACLECONTROL03.CTL
Now I’m ready to disconnect
the flash drive. With no direct warning (nothing raised to the user interface
until an operation is attempted), the instance crashes.
SQL> select name from v$controlfile;
select name from v$controlfile
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
The alert log shows the
following (plus much more; this is the relevant part):
Thu Mar 06 12:41:15 2008
Errors in file d:oracleproduct10.2.0admindb10bdumpdb10_ckpt_2756.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: ‘F:ORACLECONTROL03.CTL’
ORA-27072: File I/O error
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 1006) The volume for a file has been externally altered so that the opened file is no longer valid.
…some more messages…
Thu Mar 06 12:41:29 2008
Instance terminated by CKPT, pid = 2756
The fix part is easy, well
known, and well documented elsewhere, but were you absolutely certain what
actually happens when a control file is lost?
Fix the problem, reconnect |
|
Coming back to online redo
log files, is the loss of a member survivable, that is, does the instance crash
like it did for loss of a control file? The answer is: it depends. A group can
have one or more members. Let’s look at #members = 1. Like the previous
example, I’ll create a group with a member on a flash drive, and make that
group the active group (what is the meaning of STATUS for the values of ACTIVE,
CURRENT, INACTIVE and UNUSED?). Once the group is active, I’ll pull the plug.
More than one way to do this, but what is the current redo log group? From the
alert log:
Extract from the alert log |
|
Just like the loss of a
control file, the instance is terminated.
SQL> select * from v$logfile;
GROUP# STATUS TYPE MEMBER
———- ——- ——- ————————————————
7 ONLINE D:ORACLEPRODUCT10.2.0ORADATADB10REDO07.LOG
8 ONLINE D:ORACLEPRODUCT10.2.0ORADATADB10REDO08.LOG
10 ONLINE F:ORACLEREDO10.LOG
9 ONLINE D:ORACLEPRODUCT10.2.0ORADATADB10REDO09.LOG
SQL> select * from v$logfile;
select * from v$logfile
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
In this case, it was LGWR
that was responsible for terminating the instance. Note that it was CKPT in the
other example. What gets recorded in the control file? The SCN. Which process
stamps the control file with the SCN? The checkpoint process.
If trying this at home,
running the database is noarchivelog mode, and using only one member in the current
group, what will it take to be able to open the database again?
Had that group been
multiplexed, and the same file was “pulled” from the system, what happens? Life
goes on, and the alert log is written to with a message about the missing
member. Let’s add a second member to group 10, placing it on a different (i.e.,
permanent) drive, and then pull the flash drive again (that should be a hint
about what it takes to open the database).
Here is what the alert log
shows:
Thu Mar 06 14:51:45 2008
Thread 1 advanced to log sequence 216
Current log# 10 seq# 216 mem# 0: F:ORACLEREDO10.LOG
Current log# 10 seq# 216 mem# 1: D:ORACLEPRODUCT10.2.0ORADATADB10REDO11.LOG
Thu Mar 06 14:53:36 2008
Errors in file d:oracleproduct10.2.0admindb10bdumpdb10_lgwr_2224.trc:
ORA-00345: redo log write error block 23 count 2
ORA-00312: online log 10 thread 1: ‘F:ORACLEREDO10.LOG’
ORA-27072: File I/O error
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 1006) The volume for a file has been externally altered so that the opened file is no longer valid.
Thu Mar 06 14:53:36 2008
Errors in file d:oracleproduct10.2.0admindb10bdumpdb10_lgwr_2224.trc:
ORA-00343: too many errors, log member closed
ORA-00346: log member marked as STALE
ORA-00312: online log 10 thread 1: ‘F:ORACLEREDO10.LOG’
The best part about the
messages is the lack of two words: Instance terminated. After a shutdown and
startup, the alert log will still “complain” about the missing file, but the
bottom line is that the database will be open.
Thu Mar 06 14:56:51 2008
ALTER DATABASE OPEN
Thu Mar 06 14:56:52 2008
Errors in file d:oracleproduct10.2.0admindb10bdumpdb10_lgwr_2288.trc:
ORA-00313: open failed for members of log group 10 of thread 1
Thu Mar 06 14:56:52 2008
Errors in file d:oracleproduct10.2.0admindb10bdumpdb10_lgwr_2288.trc:
ORA-00313: open failed for members of log group 10 of thread 1
Thu Mar 06 14:56:52 2008
Thread 1 opened at log sequence 216
Current log# 10 seq# 216 mem# 1: D:ORACLEPRODUCT10.2.0ORADATADB10REDO11.LOG
Successful open of redo thread 1
…some other messages…
Thu Mar 06 14:56:59 2008
Completed: ALTER DATABASE OPEN
Loss of datafiles
So far, recovering from loss
of a control file or redo log file is pretty close to losing no data. The
single member redo log group example is a case where data loss can take place.
With respect to datafiles, there are two kinds: the ones that matter to you (your
data) and the ones that matter to Oracle (its data). To clarify this
categorization, here are two questions:
1.
What happens if one your
tablespaces loses a datafile?
2.
What happens if the SYSTEM
tablespace, as an example of an owned-by-Oracle tablespace, loses a datafile?
If you answered those
questions as is, what was your frame of reference with respect to archiving? If
you didn’t consider archiving when formulating your answers, then you stand a
50-50% chance of having been wrong.
If operating in NOARCHIVELOG
mode, and you lose a datafile – any datafile – it’s game over, instance
terminates. If operating in ARCHIVELOG mode, then the answers to the two
questions above are different: loss of your datafile means the instance keeps
going, loss of an Oracle specific datafile means the instance is terminated.
The section titled “Recovering After the Loss of Datafiles: Scenarios” in the Backup
and Recovery Advanced User’s Guide spells this out quite nicely.
In Closing
Pretty much all of the
backup and recovery scenarios related to loss of special files can be
replicated on a PC using a very inexpensive flash drive. Whether on an AIX
5300L gazillion CPU machine with a gazillion megabytes of RAM or on your
personal desktop/laptop that just barely runs Oracle, the principles are
virtually the same. Give it a try and see for yourself. These are drills best
practiced without the pressure of having to do these things when your instance
is crashed and time is money.