Synopsis.
Data Guards capability to recover a mission-critical database upon the loss of
the entire production site is at the heart of Oracle 11gs disaster recovery
features. This article the sixth in this ongoing series explores how to
manually fail over a production database to its corresponding physical standby
database, as well as reinstate a failed primary database to a physical
standby.
The prior
article in this series demonstrated:
-
How to leverage Oracle 11g Data Guard Real Time
Query features
-
How Oracle 11g
Data Guard snapshot standby database
features simultaneously provide disaster recovery and quality assurance testing
This
article will show how to:
-
Manually
initiate a failover operation when the primary database is no longer
accessible
-
Reinstate
a failed primary database by transforming it back into a physical standby
database
Pondering the Unthinkable: Failover
I
dont think Ive ever met an Oracle DBA whos been thrilled to discuss the
possible loss of her complete production database site, even if she had
simulated that loss and practiced the required steps to full database recovery.
But lets face it: Even if a DBA follows Oracles recommended best practices
precisely to guard against data loss, theres always a miniscule chance that
the production database site could become untenable through no fault of our
own. Thankfully, once an Oracle 11g
Data Guard standby environment has been enabled, Oracle 11g offers excellent survivability of a
production database while simultaneously insuring that any data loss is limited
depending on the data protection mode selected.
It Cant Hurt To Ask... Though it might
seem to be inappropriate at the time, the first question that Ill typically
ask in an apparent failover situation is Do we really need to perform a
failover in this situation? There may be a way around the apparent loss of the
primary database, or its possible that the apparent disaster is really not
that disastrous. For example, if the production database is unavailable for
only a few moments, or maybe even an hour, during off-peak business hours, is
it really a problem, or just a
matter of perception and extra paperwork? The reason Im not afraid to challenge
the necessity of failover is that there are some potentially serious
consequences of completing the failover operation:
-
The original
production database is toast. The primary database essentially
becomes just a bunch of servers and disk drives at this point, and it may not
be restorable to physical standby state until a full backup can be obtained
using the new primary database as its source.
-
The physical
standby becomes the primary. Unless there is a second physical
standby already configured, I now have a single point of failure in my
production system, and my top priority must be re-establishment of the
primary-standby relationship.
-
The physical
standby may be underpowered for peak business hours. Lets be
honest: Many organizations look upon a physical standby environment as a
necessary evil in terms of hardware and licensing costs, and my new primary database
may soon become overwhelmed by peak-time transaction volumes.
Detecting a Failover Situation. Once Ive
established that its truly necessary to continue the failover operation, Ill
use Data Guard Broker (DGB) to
assist in the transition. To illustrate how DGB detects a failover situation,
Ill start with the ORCL_PRIMARY and ORCL_STDBY1 databases both
fulfilling their initial intended respective roles of primary database and
counterpart physical standby database. The DGMGRL utility reflects this as
well, and it shows that everything is copasetic:
[oracle@11gStdby ~]$ dgmgrl
DGMGRL for Linux: Version 11.1.0.6.0 - Production
Copyright (c) 2000, 2005, Oracle. All rights reserved.
Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys/oracle
Connected.
DGMGRL> show configuration
Configuration
Name: MAA_orcl
Enabled: YES
Protection Mode: MaxPerformance
Databases:
orcl_primary - Primary database
orcl_stdby1 - Physical standby database
Fast-Start Failover: DISABLED
Current status for "MAA_orcl":
SUCCESS
To
simulate the necessity for a failover to the physical standby database, Ill
simply issue the kill -9 <pid> command
against the Server Monitor (SMON)
background process for the primary database. The primary databases alert log
detects this almost immediately, of course:
. . .
Tue Aug 25 18:54:10 2009
Errors in file /u01/app/oracle/diag/rdbms/orcl_primary/orcl_primary/trace/orcl_primary_pmon_6166.trc:
ORA-00474: SMON process terminated with error
PMON (ospid: 6166): terminating the instance due to error 474
Instance terminated by PMON, pid = 6166
. . .
The
corresponding physical standby databases Remote
File Server (RFS) background process also
detects the loss of connectivity to the primary database, as reflected in that
databases alert log:
. . .
Tue Aug 25 18:54:49 2009
RFS[2]: Possible network disconnect with primary database
Tue Aug 25 18:54:49 2009
RFS[1]: Possible network disconnect with primary database
Tue Aug 25 18:55:49 2009
. . .
To
ascertain whats wrong here, Ill immediately verify the Data Guard
configuration by connecting via DGMGRL and run the SHOW CONFIGURATION
and SHOW
DATABASE commands:
DGMGRL> show configuration verbose;
Configuration
Name: MAA_orcl
Enabled: YES
Protection Mode: MaxPerformance
Databases:
orcl_primary - Primary database
orcl_stdby1 - Physical standby database
Fast-Start Failover: DISABLED
Current status for "MAA_orcl":
Error: ORA-16625: cannot reach the database
Failing Over to the Physical Standby. Its
obvious that something is seriously wrong here! Once Ive asked my usual
probing questions and determined that the failover is indeed necessary, Ill
start the failover process with one simple DGB command, FAILOVER TO
<standby database>:
DGMGRL> failover to orcl_stdby1;
Performing failover NOW, please wait...
Failover succeeded, new primary is "orcl_stdby1"
DGMGRL> show configuration
Configuration
Name: MAA_orcl
Enabled: YES
Protection Mode: MaxPerformance
Databases:
orcl_stdby1 - Primary database
orcl_primary - Physical standby database (disabled)
Fast-Start Failover: DISABLED
Current status for "MAA_orcl":
SUCCESS
DGMGRL> show database orcl_stdby1
Database
Name: orcl_stdby1
Role: PRIMARY
Enabled: YES
Intended State: TRANSPORT-ON
Instance(s):
orcl_stdby1
Current status for "orcl_stdby1":
SUCCESS
During
the failover to the physical standby database, the Oracle 11g DGB performs the
following steps:
-
First, it validates that the target standby
database is ready to accept the primary role.
If its not, DGB will not allow the failover to continue until the DBA has
manually resolved any discrepancies.
-
It then halts
the Redo Apply process (MRP0) on the target standby
database once all unapplied redo has been applied successfully.
-
It completes the
failover to the physical standby database by opening the standby in
READ WRITE MODE.
-
Finally, DGB determines if there are any other
physical standby databases that need to be re-enabled; it then restarts redo
transport from the new primary database to any extant physical standby
databases.
The
corresponding alert log entries from the original standby database (now the
primary database) are displayed in Listing 6.1.
Have Any Data Been Lost? As with many
questions were asked as DBAs, the answer to this one is It depends. In this
case, its going to depend on which data
protection mode either Maximum
Performance, Maximum Availability, or Maximum Protection - that Ive chosen for the Data Guard
configuration. (See my prior
article on this topic for a comparison on these data protection options and
their corresponding effect on data loss.)
However,
theres one other factor to consider in a failover situation. The Data Guard FAILOVER TO
<standby database> command also offers an optional
directive, IMMEDIATE,
which instructs DGB to an immediate
failover instead of a complete
failover, and here are the implications:
-
During a COMPLETE failover (the default),
Oracle 11g will attempt to
recover the maximum amount of redo data based on the DG configuration's
specified protection mode. After a complete failover, DGB avoids disabling any standby databases
that are not the failover target. For obvious reasons, Oracle recommends using
complete failover.
-
During an IMMEDIATE failover, on the other
hand, Oracle 11g applies no
additional redo data to the standby database after
the failover is invoked. While it's the fastest type of failover, it
does require that the original primary database as well as all standby
databases that aren't targets of the failover must
be re-enabled.
Failing Over To a Logical Standby Database: Considerations.
Though its definitely possible, Id like to avoid using a logical standby
database as a failover target:
-
A logical standby doesnt use Redo Apply to apply
changes to data; instead, it uses SQL Apply
just as Oracle Streams does during basic database replication. There are
intrinsic limitations on which data types can accept DML for changed data, and thus
not all the data in the primary database may
be present in the logical standby database.
-
SQL Apply can be configured to accept only selected subsets of logical
change records, so even if all of a tables data types are completely
supported, not all of the changes on the
primary database may have been applied on the logical standby database.
-
Perhaps most importantly, once a failover to a logical
standby database is complete, all other standby
databases will be permanently
disabled and must be recreated manually.
Ill explore these limitations (but
also highlight the features) of logical standby databases in an upcoming
article in this ongoing series.
Reinstatement: Post-Failover Stress Relief
Prior
to Oracle Database Release 10g,
once the primary database had failed, it was impossible to bring the original
primary database back into the Data Guard configuration as a new physical
standby database without having to rebuild it completely. This was usually
accomplished by using the new primary database as the target database and the
original primary as the auxiliary database within the bounds of the DUPLICATE DATABASE
Recovery Manager (RMAN) command set.
As
long as certain prerequisites are in place, however, Oracle 10g offered the ability to reinstate the
original primary database as a physical standby database:
-
Flashback
Logging must have been enabled on the original primary database.
(Since this is a prerequisite for activating a physical standby database as a snapshot standby database in Oracle 11g, this
requirement is usually already in place if Im following Oracle 11gs best practice recommendations.)
-
Sufficient Flashback Logs must be present on the
original primary database server to allow Flashback
Database to rewind the database to a state prior to its failure.
-
Finally, the original primary database must be able
to communicate over the network to the new primary database and Data Guard
Broker so that it can be restored
properly, Redo Apply can be re-established,
and it can be resynchronized with
the Data Guard configuration.
Reinstating a Failed Primary Database. To
reinstate my failed ORCL_PRIMARY database as a new physical standby
database, the original primary database has to be shut down and then brought
back to MOUNT
state. Otherwise, the initial attempt at reinstatement will fail as shown below
in this DGMGRL
session:
DGMGRL> reinstate database orcl_primary
Reinstating database "orcl_primary", please wait...
Error: ORA-16653: failed to reinstate database
Failed.
Reinstatement of database "orcl_primary" failed
After
connecting to the ORCL_PRIMARY database via
SQL*Plus and issuing the STARTUP MOUNT command, Ill again attempt to
reinstate the database:
DGMGRL> reinstate database orcl_primary
Reinstating database "orcl_primary", please wait...
Operation requires shutdown of instance "orcl_primary" on database "orcl_primary"
Shutting down instance "orcl_primary"...
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
Operation requires startup of instance "orcl_primary" on database "orcl_primary"
Starting instance "orcl_primary"...
Reinstating database "orcl_primary", please wait...
Reinstatement of database "orcl_primary" succeeded
The
successful reinstatement of ORCL_PRIMARY as a physical
standby database is reflected in the Data Guard configuration as well:
DGMGRL> show configuration
Configuration
Name: MAA_orcl
Enabled: YES
Protection Mode: MaxPerformance
Databases:
orcl_stdby1 - Primary database
orcl_primary - Physical standby database
Fast-Start Failover: DISABLED
Current status for "MAA_orcl":
SUCCESS
DGMGRL> show database orcl_primary
Database
Name: orcl_primary
Role: PHYSICAL STANDBY
Enabled: YES
Intended State: APPLY-ON
Instance(s):
orcl_primary
Current status for "orcl_primary":
SUCCESS
The
corresponding alert log entries from the successful reinstatement process for
the ORCL_PRIMARY database are shown in Listing
6.2.
Reverting to the Original Data Guard Configuration.
Now that Ive successfully reinstated my original primary database to a
physical standby database, Ill simply perform a switchover operation via DGB
to return the original primary and physical standby databases to their initial
roles in the Data Guard configuration:
DGMGRL> switchover to orcl_primary;
Performing switchover NOW, please wait...
New primary database "orcl_primary" is opening...
Operation requires shutdown of instance "orcl_stdby1" on database "orcl_stdby1"
Shutting down instance "orcl_stdby1"...
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.
Operation requires startup of instance "orcl_stdby1" on database "orcl_stdby1"
Starting instance "orcl_stdby1"...
ORACLE instance started.
Database mounted.
Switchover succeeded, new primary is "orcl_primary"
DGMGRL>
Next Steps
In
the next article in this series, Ill explore
-
Configure
the primary database and one physical standby database for Fast-Start Failover (FSF)
-
Activate,
monitor, and relocate the Fast-Start Failover Observer (FSFO)
-
Insure against the loss of a single FSFO via Enterprise Manager Grid Control
References and Additional Reading
While
Im hopeful that Ive given you a thorough grounding in the technical aspects
of the features Ive discussed in this article, Im also sure that there may be
better documentation available since its been published. I therefore strongly
suggest that you take a close look at the corresponding Oracle documentation on
these features to obtain crystal-clear understanding before attempting to
implement them in a production environment. Please note that Ive drawn upon
the following Oracle Database 11g
documentation for the deeper technical details of this article:
B28279-02 Oracle Database 11g New Features Guide
B28294-03 Oracle Database 11g Data Guard Concepts and
Administration
B28295-03 Oracle Database 11g Data Guard Broker
B28320-01 Oracle Database 11g Reference Guide
B28419-02 Oracle Database 11g PL/SQL Packages and Types
Reference
»
See All Articles by Columnist Jim Czuprynski