Synopsis.
While Oracle 11gs Data Guard definitely protects a database when the entire
production site is lost via its failover capabilities, its still necessary for
an Oracle DBA to intervene to complete the failover process. This article the
seventh in this ongoing series shows how to set up and control automatic
failover using the Fast-Start Failover Observer so that DBA intervention is no
longer required during a disaster recovery scenario.
The prior
article in this series demonstrated how to:
-
Manually
initiate a failover operation when the primary database is no longer
accessible
-
Reinstate
a failed primary database by transforming it back into a physical standby
database
This
article will show how to:
-
Configure
the primary database and one physical standby database for Fast-Start Failover (FSF)
-
Activate,
monitor, and relocate the Fast-Start Failover Observer (FSFO)
-
Insure against the loss of a single FSFO via Enterprise Manager Grid Control
Fast-Start Failover: Concepts
Oracle
introduced the Fast-Start Failover
(FSF) feature set in Release 10gR2,
but its been enhanced significantly in Oracle 11g to permit much finer-grained control over the conditions
under which a FSF would be initiated. Simply put, FSF ensures that under the
appropriate circumstances some of which are mandatory,
and some of which are optionally-configured
- a failover to the chosen standby target database will occur without DBA
intervention. The following occurrences will initiate FSF:
|
Table 7-1. Fast-Start
Failover: Triggering Events
|
|
Event Class
|
Triggering Event
|
|
Connectivity Loss
|
Network
connectivity is lost simultaneously between the primary database and:
-
the FSFO itself;
and
-
the standby database
designated as the FSF target
and
the connectivity time lost exceeds the FSF
threshold
|
|
Database Health Check Failure
|
A
database health check detects any of the following optionally-configured
failures:
- Any datafile has gone offline due to a write error
- A critical database object has dictionary
corruption
- A control file is permanently destroyed because a
disk has failed
- Log Writer (LGWR) cannot write to any member of a
log group due to an I/O failure
- Archiver (ARCn) cannot archive a redo log because the
destination is full or unavailable
|
|
Instance Crash (Single Instance)
|
The
primary databases instance has crashed
|
|
Instances Crash (RAC)
|
All
instances for a RAC primary database have crashed
|
|
Shutdown Abort on Primary
|
The
primary database is shut down with the SHUTDOWN ABORT command
|
Fast-Start
Failover detects one of these failover situations through the Fast-Start Failover Observer (FSFO). The
FSFO leverages the Oracle Call Interface
(OCI) architecture to decide when
a failover is necessary, which
physical standby database should be the target of the failover, and how long to wait until it declares a
failover is absolutely necessary. The good news is that I have quite a bit of
control over these three directives, all of which are completely configurable
using the Data Guard Broker Manager
utility (DGMGRL).
Fast-Start Failover: Installation. The FSFO
application software is automatically loaded as part of the standard installation
when an Oracle 11g database home
is created. Either DGMGRL or Oracle EM Grid Control can be used to control the
FSFO when a complete database home installation is present. Alternatively, FSFO
may be installed by downloading the Oracle 11g Client installation software
from otn.oracle.com and then
installing just the Oracle Client Administrator on the desired server; however,
its important to note that when it has been installed on a separate server,
the FSFO can only be managed via the DGMGRL utility.
Fast-Start Failover: Basic Configuration. Since
its certainly possible that more than one physical standby database could
exist in a Data Guard configuration, the first thing that Ill need to
establish is which physical standby database should be paired with the primary
database in case a fast-start failover is initiated. Ill do that by setting a
value for the FastStartFailoverTarget
parameter via the DGMGRL utility. Note that Ive chosen the primary
database as the fast-start failover target for the selected physical standby
database as well:
DGMGRL> EDIT DATABASE orcl_primary SET PROPERTY FastStartFailoverTarget = 'orcl_stdby1';
DGMGRL> EDIT DATABASE orcl_stdby1 SET PROPERTY FastStartFailoverTarget = 'orcl_primary';
Next,
Ill establish how long the Fast-Start Failover Observer should wait until it
decides that the primary database is unreachable by setting a value of 180
seconds for the FastStartFailoverThreshold parameter:
EDIT CONFIGURATION SET PROPERTY FastStartFailoverThreshold = '180';
Now
that the basic fast-start failover configuration is completed, I can confirm
its status with the SHOW FAST_START FAILOVER command:
DGMGRL> show fast_start failover
Fast-Start Failover: DISABLED
Threshold: 90 seconds
Target: (none)
Observer: orcl_stdby1
Lag Limit: 30 seconds
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Configurable Failover Conditions
Health Conditions:
Corrupted Controlfile YES
Corrupted Dictionary YES
Inaccessible Logfile NO
Stuck Archiver NO
Datafile Offline YES
Oracle Error Conditions:
(none)
DGMGRL> show database verbose orcl_primary;
Database
Name: orcl_primary
Role: PRIMARY
Enabled: YES
Intended State: TRANSPORT-ON
Instance(s):
orcl_primary
Properties:
DGConnectIdentifier = 'orcl_primary'
ObserverConnectIdentifier = ''
LogXptMode = 'ASYNC'
DelayMins = '0'
Binding = 'OPTIONAL'
MaxFailure = '0'
MaxConnections = '1'
ReopenSecs = '300'
NetTimeout = '30'
RedoCompression = 'DISABLE'
LogShipping = 'ON'
PreferredApplyInstance = ''
ApplyInstanceTimeout = '0'
ApplyParallel = 'AUTO'
StandbyFileManagement = 'AUTO'
ArchiveLagTarget = '0'
LogArchiveMaxProcesses = '4'
LogArchiveMinSucceedDest = '1'
DbFileNameConvert = ''
LogFileNameConvert = ''
FastStartFailoverTarget = 'orcl_stdby1'
StatusReport = '(monitor)'
InconsistentProperties = '(monitor)'
InconsistentLogXptProps = '(monitor)'
SendQEntries = '(monitor)'
LogXptStatus = '(monitor)'
RecvQEntries = '(monitor)'
HostName = '11gPrimary'
SidName = 'orcl_primary'
StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/ORCL/'
AlternateLocation = ''
LogArchiveTrace = '0'
LogArchiveFormat = 'log_%s_%t_%r.arc'
LatestLog = '(monitor)'
TopWaitEvents = '(monitor)'
Current status for "orcl_primary":
SUCCESS
DGMGRL> show database verbose orcl_stdby1
Database
Name: orcl_stdby1
Role: PHYSICAL STANDBY
Enabled: YES
Intended State: APPLY-ON
Instance(s):
orcl_stdby1
Properties:
DGConnectIdentifier = 'orcl_stdby1'
ObserverConnectIdentifier = ''
LogXptMode = 'ASYNC'
DelayMins = '0'
Binding = 'OPTIONAL'
MaxFailure = '0'
MaxConnections = '1'
ReopenSecs = '300'
NetTimeout = '30'
RedoCompression = 'DISABLE'
LogShipping = 'ON'
PreferredApplyInstance = ''
ApplyInstanceTimeout = '0'
ApplyParallel = 'AUTO'
StandbyFileManagement = 'AUTO'
ArchiveLagTarget = '0'
LogArchiveMaxProcesses = '4'
LogArchiveMinSucceedDest = '1'
DbFileNameConvert = ''
LogFileNameConvert = '/u01/app/oracle/oradata/orcl/, /u01/app/oracle/oradata/stdby/'
FastStartFailoverTarget = 'orcl_primary'
StatusReport = '(monitor)'
InconsistentProperties = '(monitor)'
InconsistentLogXptProps = '(monitor)'
SendQEntries = '(monitor)'
LogXptStatus = '(monitor)'
RecvQEntries = '(monitor)'
HostName = '11gStdby'
SidName = 'orcl_stdby1'
StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/STDBY/'
AlternateLocation = ''
LogArchiveTrace = '0'
LogArchiveFormat = 'log_%s_%t_%r.arc'
LatestLog = '(monitor)'
TopWaitEvents = '(monitor)'
Current status for "orcl_stdby1":
SUCCESS
Configuring Detection of Advanced Failover Conditions
Which
physical standby database is the target for Fast-Start Failover and how long to
wait until initiating Fast-Start Failover are obviously the most important
reactions I want to control in a failover situation. However, Oracle 11g Data
Guard Broker also offers the ability to adjust just how much lost redo data is permitted before declaring a
failover and what it should do with the primary database after the failover operation. Heres a
complete list of the configurable thresholds, triggering events, and
post-failover behaviors in a Fast-Start Failover situation:
|
Table 7-2. Fast-Start
Failover: Configurable Reactions
|
|
FSFO Parameter
|
Default Value
|
Triggering Event
|
|
FastStartFailoverTarget
|
NULL
|
Indicates
which database the FSFO will choose as its failover target should a
Fast-Start Failover occur
|
|
FastStartFailoverThreshold
|
90
|
Determines
how long the FSFO will wait
until it triggers a Fast-Start Failover. Note that both the target physical standby and the FSFO must have lost network
connectivity to the primary database for this time limit (in seconds) before
FSF is triggered
|
|
FastStartFailoverLagLimit
|
30
|
When
the primary database is operating in Maximum
Performance mode, this sets a limit for the amount of lost data in
seconds before the FSFO should trigger a Fast-Start Failover
|
|
FastStartFailoverPmyShutdown
|
TRUE
|
Determines
whether the FSFO will automatically shut
down the primary database after a Fast-Start Failover occurs
|
|
FastStartFailoverAutoReinstate
|
TRUE
|
Determines
whether the FSFO will automatically attempt to reinstantiate the designated primary database as the physical standby as soon as the FSFO
detects the primary site is once again available
|
Advanced Failover Options. In addition to
these conditional triggering events and controls, FSFO also offers exquisitely
detailed control over which failures of the primary databases infrastructure
would trigger a Fast-Start Failover:
|
Table 7-3. Fast-Start
Failover: Advanced Failover Triggers
|
|
Trigger
|
Active By Default?
|
Explanation
|
|
Datafile
Offline
|
Yes
|
Tells
FSFO to perform a failover whenever a
datafile is offline on the primary database
|
|
Corrupted
Controlfile
|
Yes
|
Tells
FSFO to perform a failover whenever a
corrupted control file is detected on the primary database
|
|
Corrupted
Dictionary
|
Yes
|
Requests
FSFO to perform a failover whenever corruption
is detected within the primary databases data dictionary
|
|
Inaccessible
Logfile
|
No
|
Forces
FSFO to perform a failover whenever any
one online redo log member cannot be accessed on the primary
database
|
|
Stuck
Archiver
|
No
|
Tells
FSFO to perform a failover whenever archived redo logging cannot proceed on
the primary database because there is no
more space to write out archived redo logs on that server
|
These
two sets of Fast-Start Failover conditions offer me extremely fine-grained
control over when FSFO should trigger a failover automatically. And even if
these controls are insufficient for my situation, theres one other option: I
can configure FSFO to trigger a Fast-Start Failover upon receipt of a specific
Oracle 11g error message code. For example, if I wanted to force a failover
whenever a critical lack of space in a Flashback Data Archive occurred, I could
configure FSFO to initiate automatic failover whenever the ORA-55623 error is detected on the primary database
with the following command:
DGMGRL> ENABLE FAST_START FAILOVER CONDITION 55623;