Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

Oracle

Posted Mar 10, 2008

Hands-on Oracle: Backup & Recovery Games

By Steve Callan

Games in Oracle? No, the games reference here is like the “games” in Olympic Games. The athletes competing in games at that level are serious about what they do, and you can pretty much bet the farm they are passionate about their sport or event as well. As an Oracle DBA, how serious – and passionate – are you about your job? Is job even the right word here? For serious practitioners, profession is probably a better choice. The “games” or events we’re going to play, more specifically, what we learn from them, in this article are practical and necessary skills a DBA should have at his or her disposal.

Loss of one or more special files

Special files in our world are the following: control files, online redo logs, archived redo logs, and datafiles. Let’s say you have several control files, identified (of course) in the initialization parameter file (I’ll use a pfile for this article for quicker editing). You’ve read about it a million times: if you lose a control file, the recovery or fix consists of copying the (or “a”) good one into the errant file’s location, or remove the reference to the file from the parameter file. After a control file is lost (for whatever reason), what happens to the database?

1.  Operations can still take place, having other control files “online” means you can survive the loss of one or more control files

2.  You (the DBA) have to shutdown, restore the file (or remove it from the init.ora file) and startup

3.  Oracle shuts down/the instance is terminated (i.e., you have no control over that; it’s happening whether or not you like it)

The reason I ask is this: what is the difference between control files and online redo logs in terms of multiplexing them? For instance, what happens to the database (or instance) if a member of the active group is lost? Does Oracle keep going, or do things come to a screeching halt? In the following experiment, I’ll place control03.ctl onto a flash drive, bounce the instance, and then pull the flash drive out of the USB port to simulate media failure.

Current listing of control files

 
*.control_files='D:\oracle\product\10.2.0/oradata/db10/\control01.ctl',
                'D:\oracle\product\10.2.0/oradata/db10/\control02.ctl',
                'D:\oracle\product\10.2.0/oradata/db10/\control03.ctl'

New listing of control files

 
*.control_files='D:\oracle\product\10.2.0/oradata/db10/\control01.ctl',
                'D:\oracle\product\10.2.0/oradata/db10/\control02.ctl',
                'F:\oracle\control03.ctl'

Confirmation of the new control file:

SQL> startup
ORACLE instance started.
 
Total System Global Area  289406976 bytes
Fixed Size                  1290184 bytes
Variable Size             104857656 bytes
Database Buffers          176160768 bytes
Redo Buffers                7098368 bytes
Database mounted.
Database opened.
SQL> col name for a52
SQL> select name from v$controlfile;
 
NAME
----------------------------------------------------
D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\CONTROL01.CTL
D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\CONTROL02.CTL
F:\ORACLE\CONTROL03.CTL

Now I’m ready to disconnect the flash drive. With no direct warning (nothing raised to the user interface until an operation is attempted), the instance crashes.

SQL> select name from v$controlfile;
select name from v$controlfile
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel

The alert log shows the following (plus much more; this is the relevant part):

Thu Mar 06 12:41:15 2008
Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_ckpt_2756.trc:
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: 'F:\ORACLE\CONTROL03.CTL'
ORA-27072: File I/O error
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 1006) The volume for a file has been externally altered so that the opened file is no longer valid.
…some more messages…
Thu Mar 06 12:41:29 2008
Instance terminated by CKPT, pid = 2756

The fix part is easy, well known, and well documented elsewhere, but were you absolutely certain what actually happens when a control file is lost?

Fix the problem, reconnect and startup

 
SQL> prompt Remove the reference to control03
SQL> startup
ORA-24324: service handle not initialized
ORA-01041: internal error. hostdef extension doesn't exist
SQL> conn sys/oracle as sysdba
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area  289406976 bytes
Fixed Size                  1290184 bytes
Variable Size             104857656 bytes
Database Buffers          176160768 bytes
Redo Buffers                7098368 bytes
Database mounted.
Database opened.
SQL>

Coming back to online redo log files, is the loss of a member survivable, that is, does the instance crash like it did for loss of a control file? The answer is: it depends. A group can have one or more members. Let’s look at #members = 1. Like the previous example, I’ll create a group with a member on a flash drive, and make that group the active group (what is the meaning of STATUS for the values of ACTIVE, CURRENT, INACTIVE and UNUSED?). Once the group is active, I’ll pull the plug. More than one way to do this, but what is the current redo log group? From the alert log:

Extract from the alert log

 
Thu Mar 06 13:22:11 2008
Thread 1 advanced to log sequence 211
  Current log# 10 seq# 211 mem# 0: F:\ORACLE\REDO10.LOG

Just like the loss of a control file, the instance is terminated.

SQL> select * from v$logfile;
    GROUP# STATUS  TYPE    MEMBER                                               
---------- ------- ------- ------------------------------------------------
         7         ONLINE  D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO07.LOG
         8         ONLINE  D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO08.LOG
        10         ONLINE  F:\ORACLE\REDO10.LOG                                 
         9         ONLINE  D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO09.LOG
SQL> select * from v$logfile;
select * from v$logfile
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel

In this case, it was LGWR that was responsible for terminating the instance. Note that it was CKPT in the other example. What gets recorded in the control file? The SCN. Which process stamps the control file with the SCN? The checkpoint process.

If trying this at home, running the database is noarchivelog mode, and using only one member in the current group, what will it take to be able to open the database again?

Had that group been multiplexed, and the same file was “pulled” from the system, what happens? Life goes on, and the alert log is written to with a message about the missing member. Let’s add a second member to group 10, placing it on a different (i.e., permanent) drive, and then pull the flash drive again (that should be a hint about what it takes to open the database).

Here is what the alert log shows:

Thu Mar 06 14:51:45 2008
Thread 1 advanced to log sequence 216
  Current log# 10 seq# 216 mem# 0: F:\ORACLE\REDO10.LOG
  Current log# 10 seq# 216 mem# 1: D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO11.LOG
Thu Mar 06 14:53:36 2008
Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2224.trc:
ORA-00345: redo log write error block 23 count 2
ORA-00312: online log 10 thread 1: 'F:\ORACLE\REDO10.LOG'
ORA-27072: File I/O error
OSD-04008: WriteFile() failure, unable to write to file
O/S-Error: (OS 1006) The volume for a file has been externally altered so that the opened file is no longer valid.
Thu Mar 06 14:53:36 2008
Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2224.trc:
ORA-00343: too many errors, log member closed
ORA-00346: log member marked as STALE
ORA-00312: online log 10 thread 1: 'F:\ORACLE\REDO10.LOG'

The best part about the messages is the lack of two words: Instance terminated. After a shutdown and startup, the alert log will still “complain” about the missing file, but the bottom line is that the database will be open.

Thu Mar 06 14:56:51 2008
ALTER DATABASE OPEN
Thu Mar 06 14:56:52 2008
Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2288.trc:
ORA-00313: open failed for members of log group 10 of thread 1
Thu Mar 06 14:56:52 2008
Errors in file d:\oracle\product\10.2.0\admin\db10\bdump\db10_lgwr_2288.trc:
ORA-00313: open failed for members of log group 10 of thread 1
Thu Mar 06 14:56:52 2008
Thread 1 opened at log sequence 216
  Current log# 10 seq# 216 mem# 1: D:\ORACLE\PRODUCT\10.2.0\ORADATA\DB10\REDO11.LOG
Successful open of redo thread 1
…some other messages…
Thu Mar 06 14:56:59 2008
Completed: ALTER DATABASE OPEN

Loss of datafiles

So far, recovering from loss of a control file or redo log file is pretty close to losing no data. The single member redo log group example is a case where data loss can take place. With respect to datafiles, there are two kinds: the ones that matter to you (your data) and the ones that matter to Oracle (its data). To clarify this categorization, here are two questions:

1.  What happens if one your tablespaces loses a datafile?

2.  What happens if the SYSTEM tablespace, as an example of an owned-by-Oracle tablespace, loses a datafile?

If you answered those questions as is, what was your frame of reference with respect to archiving? If you didn’t consider archiving when formulating your answers, then you stand a 50-50% chance of having been wrong.

If operating in NOARCHIVELOG mode, and you lose a datafile – any datafile – it’s game over, instance terminates. If operating in ARCHIVELOG mode, then the answers to the two questions above are different: loss of your datafile means the instance keeps going, loss of an Oracle specific datafile means the instance is terminated. The section titled “Recovering After the Loss of Datafiles: Scenarios” in the Backup and Recovery Advanced User’s Guide spells this out quite nicely.

In Closing

Pretty much all of the backup and recovery scenarios related to loss of special files can be replicated on a PC using a very inexpensive flash drive. Whether on an AIX 5300L gazillion CPU machine with a gazillion megabytes of RAM or on your personal desktop/laptop that just barely runs Oracle, the principles are virtually the same. Give it a try and see for yourself. These are drills best practiced without the pressure of having to do these things when your instance is crashed and time is money.

» See All Articles by Columnist Steve Callan



Oracle Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
Oracle Forum
Topic By Replies Updated
Oracle Data Mining: Classification jan.hasller 0 July 5th, 07:19 AM
Find duplicates - Unique IDs Lava 5 July 2nd, 08:30 AM
no matching unique or primary key rcanter 1 April 25th, 12:32 PM
Update values of one table based on condition of values in other table using Trigger Gladiator 3 February 29th, 06:01 PM