Synopsis. While Oracle 11g’s Data Guard definitely protects a database when the entire production site is lost via its failover capabilities, it’s still necessary for an Oracle DBA to intervene to complete the failover process. This article – the seventh in this ongoing series – shows how to set up and control automatic failover using the Fast-Start Failover Observer so that DBA intervention is no longer required during a disaster recovery scenario.
The prior article in this series demonstrated how to:
- Manually initiate a failover operation when the primary database is no longer accessible
- Reinstate a failed primary database by transforming it back into a physical standby database
This article will show how to:
- Configure the primary database and one physical standby database for Fast-Start Failover (FSF)
- Activate, monitor, and relocate the Fast-Start Failover Observer (FSFO)
- Insure against the loss of a single FSFO via Enterprise Manager Grid Control
Fast-Start Failover: Concepts
Oracle introduced the Fast-Start Failover (FSF) feature set in Release 10gR2, but it’s been enhanced significantly in Oracle 11g to permit much finer-grained control over the conditions under which a FSF would be initiated. Simply put, FSF ensures that under the appropriate circumstances – some of which are mandatory, and some of which are optionally-configured – a failover to the chosen standby target database will occur without DBA intervention. The following occurrences will initiate FSF:
Table 7-1. Fast-Start Failover: Triggering Events |
|
Event Class |
Triggering Event |
Connectivity Loss |
Network connectivity is lost simultaneously between the primary database and:
and the connectivity time lost exceeds the FSF threshold |
Database Health Check Failure |
A database health check detects any of the following optionally-configured failures:
|
Instance Crash (Single Instance) |
The primary database’s instance has crashed |
Instances Crash (RAC) |
All instances for a RAC primary database have crashed |
Shutdown Abort on Primary |
The primary database is shut down with the SHUTDOWN ABORT command |
Fast-Start Failover detects one of these failover situations through the Fast-Start Failover Observer (FSFO). The FSFO leverages the Oracle Call Interface (OCI) architecture to decide when a failover is necessary, which physical standby database should be the target of the failover, and how long to wait until it declares a failover is absolutely necessary. The good news is that I have quite a bit of control over these three directives, all of which are completely configurable using the Data Guard Broker Manager utility (DGMGRL).
Fast-Start Failover: Installation. The FSFO application software is automatically loaded as part of the standard installation when an Oracle 11g database home is created. Either DGMGRL or Oracle EM Grid Control can be used to control the FSFO when a complete database home installation is present. Alternatively, FSFO may be installed by downloading the Oracle 11g Client installation software from otn.oracle.com and then installing just the Oracle Client Administrator on the desired server; however, it’s important to note that when it has been installed on a separate server, the FSFO can only be managed via the DGMGRL utility.
Fast-Start Failover: Basic Configuration. Since it’s certainly possible that more than one physical standby database could exist in a Data Guard configuration, the first thing that I’ll need to establish is which physical standby database should be paired with the primary database in case a fast-start failover is initiated. I’ll do that by setting a value for the FastStartFailoverTarget parameter via the DGMGRL utility. Note that I’ve chosen the primary database as the fast-start failover target for the selected physical standby database as well:
DGMGRL> EDIT DATABASE orcl_primary SET PROPERTY FastStartFailoverTarget = 'orcl_stdby1'; DGMGRL> EDIT DATABASE orcl_stdby1 SET PROPERTY FastStartFailoverTarget = 'orcl_primary';
Next, I’ll establish how long the Fast-Start Failover Observer should wait until it decides that the primary database is unreachable by setting a value of 180 seconds for the FastStartFailoverThreshold parameter:
EDIT CONFIGURATION SET PROPERTY FastStartFailoverThreshold = '180';
Now that the basic fast-start failover configuration is completed, I can confirm its status with the SHOW FAST_START FAILOVER command:
DGMGRL> show fast_start failover Fast-Start Failover: DISABLED Threshold: 90 seconds Target: (none) Observer: orcl_stdby1 Lag Limit: 30 seconds Shutdown Primary: TRUE Auto-reinstate: TRUE Configurable Failover Conditions Health Conditions: Corrupted Controlfile YES Corrupted Dictionary YES Inaccessible Logfile NO Stuck Archiver NO Datafile Offline YES Oracle Error Conditions: (none) DGMGRL> show database verbose orcl_primary; Database Name: orcl_primary Role: PRIMARY Enabled: YES Intended State: TRANSPORT-ON Instance(s): orcl_primary Properties: DGConnectIdentifier = 'orcl_primary' ObserverConnectIdentifier = '' LogXptMode = 'ASYNC' DelayMins = '0' Binding = 'OPTIONAL' MaxFailure = '0' MaxConnections = '1' ReopenSecs = '300' NetTimeout = '30' RedoCompression = 'DISABLE' LogShipping = 'ON' PreferredApplyInstance = '' ApplyInstanceTimeout = '0' ApplyParallel = 'AUTO' StandbyFileManagement = 'AUTO' ArchiveLagTarget = '0' LogArchiveMaxProcesses = '4' LogArchiveMinSucceedDest = '1' DbFileNameConvert = '' LogFileNameConvert = '' FastStartFailoverTarget = 'orcl_stdby1' StatusReport = '(monitor)' InconsistentProperties = '(monitor)' InconsistentLogXptProps = '(monitor)' SendQEntries = '(monitor)' LogXptStatus = '(monitor)' RecvQEntries = '(monitor)' HostName = '11gPrimary' SidName = 'orcl_primary' StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/ORCL/' AlternateLocation = '' LogArchiveTrace = '0' LogArchiveFormat = 'log_%s_%t_%r.arc' LatestLog = '(monitor)' TopWaitEvents = '(monitor)' Current status for "orcl_primary": SUCCESS DGMGRL> show database verbose orcl_stdby1 Database Name: orcl_stdby1 Role: PHYSICAL STANDBY Enabled: YES Intended State: APPLY-ON Instance(s): orcl_stdby1 Properties: DGConnectIdentifier = 'orcl_stdby1' ObserverConnectIdentifier = '' LogXptMode = 'ASYNC' DelayMins = '0' Binding = 'OPTIONAL' MaxFailure = '0' MaxConnections = '1' ReopenSecs = '300' NetTimeout = '30' RedoCompression = 'DISABLE' LogShipping = 'ON' PreferredApplyInstance = '' ApplyInstanceTimeout = '0' ApplyParallel = 'AUTO' StandbyFileManagement = 'AUTO' ArchiveLagTarget = '0' LogArchiveMaxProcesses = '4' LogArchiveMinSucceedDest = '1' DbFileNameConvert = '' LogFileNameConvert = '/u01/app/oracle/oradata/orcl/, /u01/app/oracle/oradata/stdby/' FastStartFailoverTarget = 'orcl_primary' StatusReport = '(monitor)' InconsistentProperties = '(monitor)' InconsistentLogXptProps = '(monitor)' SendQEntries = '(monitor)' LogXptStatus = '(monitor)' RecvQEntries = '(monitor)' HostName = '11gStdby' SidName = 'orcl_stdby1' StandbyArchiveLocation = '/u01/app/oracle/flash_recovery_area/STDBY/' AlternateLocation = '' LogArchiveTrace = '0' LogArchiveFormat = 'log_%s_%t_%r.arc' LatestLog = '(monitor)' TopWaitEvents = '(monitor)' Current status for "orcl_stdby1": SUCCESS
Configuring Detection of Advanced Failover Conditions
Which physical standby database is the target for Fast-Start Failover and how long to wait until initiating Fast-Start Failover are obviously the most important reactions I want to control in a failover situation. However, Oracle 11g Data Guard Broker also offers the ability to adjust just how much lost redo data is permitted before declaring a failover and what it should do with the primary database after the failover operation. Here’s a complete list of the configurable thresholds, triggering events, and post-failover behaviors in a Fast-Start Failover situation:
Table 7-2. Fast-Start Failover: Configurable Reactions |
||
FSFO Parameter |
Default Value |
Triggering Event |
FastStartFailoverTarget |
NULL |
Indicates which database the FSFO will choose as its failover target should a Fast-Start Failover occur |
FastStartFailoverThreshold |
90 |
Determines how long the FSFO will wait until it triggers a Fast-Start Failover. Note that both the target physical standby and the FSFO must have lost network connectivity to the primary database for this time limit (in seconds) before FSF is triggered |
FastStartFailoverLagLimit |
30 |
When the primary database is operating in Maximum Performance mode, this sets a limit for the amount of lost data in seconds before the FSFO should trigger a Fast-Start Failover |
FastStartFailoverPmyShutdown |
TRUE |
Determines whether the FSFO will automatically shut down the primary database after a Fast-Start Failover occurs |
FastStartFailoverAutoReinstate |
TRUE |
Determines whether the FSFO will automatically attempt to reinstantiate the designated primary database as the physical standby as soon as the FSFO detects the primary site is once again available |
Advanced Failover Options. In addition to these conditional triggering events and controls, FSFO also offers exquisitely detailed control over which failures of the primary database’s infrastructure would trigger a Fast-Start Failover:
Table 7-3. Fast-Start Failover: Advanced Failover Triggers |
||
Trigger |
Active By Default? |
Explanation |
Datafile Offline |
Yes |
Tells FSFO to perform a failover whenever a datafile is offline on the primary database |
Corrupted Controlfile |
Yes |
Tells FSFO to perform a failover whenever a corrupted control file is detected on the primary database |
Corrupted Dictionary |
Yes |
Requests FSFO to perform a failover whenever corruption is detected within the primary database’s data dictionary |
Inaccessible Logfile |
No |
Forces FSFO to perform a failover whenever any one online redo log member cannot be accessed on the primary database |
Stuck Archiver |
No |
Tells FSFO to perform a failover whenever archived redo logging cannot proceed on the primary database because there is no more space to write out archived redo logs on that server |
These two sets of Fast-Start Failover conditions offer me extremely fine-grained control over when FSFO should trigger a failover automatically. And even if these controls are insufficient for my situation, there’s one other option: I can configure FSFO to trigger a Fast-Start Failover upon receipt of a specific Oracle 11g error message code. For example, if I wanted to force a failover whenever a critical lack of space in a Flashback Data Archive occurred, I could configure FSFO to initiate automatic failover whenever the ORA-55623 error is detected on the primary database with the following command:
DGMGRL> ENABLE FAST_START FAILOVER CONDITION 55623;