Delaying Standby Database Recovery

Standby databases have been both bane and boon to DBAs. The boon is the ability to quickly recover from a Primary database failure by failing over to the Standby essentially allowing the business to “pick up where they left off” when the unthinkable happens. The bane, many times, is errant transactions which compromise the production data on the Primary side.

Given that, in most Standby or Data Guard installations, the lag period between Primary and Standby is no more than the time it takes to switch logs, this can be far too short of an interval for the DBA to respond to prevent the responsible insert/update/delete from being applied at the Standby site.

Jump ahead:

Standby is a data recovery linchpin.

Delaying Standby recovery saves the day.

Can multiple Standby databases have different delay settings?

Should delaying Standby recovery be a standard practice?

Yes, such a scenario can be a rare occurrence but it’s not beyond the realm of possibility. Compound that with customers using a Standby database as a near-real-time reporting database and the situation gets more complicated. What’s a DBA to do? Let’s examine this scenario and see what possible options DBAs have at their disposal.

Standby is a data recovery linchpin.

Standby databases, whether or not they are managed via Data Guard, have been key components in most Disaster Recovery configurations. All data flowing into or out of the Primary is systematically replicated to one or more Standby databases by configuring Standby redo logs (version 11 and later) or redo transport (all versions).

Let’s look at that last statement in more detail.

When Standby redo logs were introduced, the behavior changed for the standard alter database recover managed Standby database disconnect from session; statement that Oracle automatically uses the Standby redo logs for transaction information.

Yes, the old reliable archived redo logs are still transported to the Standby. But they weren’t the main source of information. The archived redo logs CAN be used, if desired, by modifying the ‘alter database’ statement. Simply add the ‘using archived logfile’ directive, and this option opens the door to setting a delay to the redo apply process so that problem transactions can be caught before they can be applied to the Standby—problem averted.

Redo log apply delay is configured in the spfile parameter log_archive_dest_N (N being a number from 1 to up to 31, depending on the version of Oracle) using, yes, you guessed it, the DELAY parameter. This parameter requires an integer value, in minutes. For example a possible configuration for a Standby database located in Kansas City would be:

log_archive_dest_4='SERVICE=stbyKC, DELAY=1440'
log_archive_dest_state_4=ENABLE

Standby number 4, in Kansas City, will now lag the Primary database by 24 hours when archived logfiles are the source of truth. The delay can be set to any value of minutes, from 1 to whatever you desire; if the delay had been set to 2880 there would be 2 days lag between Primary and Standby. Yes, 2880 minutes is a bit extreme but it’s not an invalid value for the parameter. Remember that corrupted transactions can occur at any time of the day or night and because of that I have seen on more than one occasion the DELAY set to 1440.

Delaying Standby recovery saves the day.

Take, as an example a situation where transactions hit the Primary database 24/7. At 3 AM on Tuesday, December 7th, the production data is rendered unusable due to a corrupted feed from Klottsylvania (not an actual place; we’re dealing in hypotheticals here).

The company’s intrepid DBA— let’s name her Dana Alden—wisely set a 24 hour delay between Primary and Standby. Dana now has time to research when the transaction was applied so the current Standby can be recovered to a time just prior the data corrupting transaction would be processed. Failing over to the Standby would now be possible since our DBA prevented the corruption that rendered the Primary unusable. Yes, the roles are reversed but the former Primary now becomes the Standby once it’s recreated from the new Primary and the corrupted data is no longer an issue.

One fly in this ointment becomes apparent if the Standby is running as an Active Data Guard database. Normally such configurations are found when the Standby is used as an active reporting database. Many times such use requires that the Standby remain in lock step with the Primary: the finance and investment industries are good examples.

Delays would render the reports essentially useless. But, don’t despair, even the finance and investment industries can benefit from using an apply delay.

Such a configuration would be found on a second or third Standby database not used for reporting. Having more than one Standby available can provide at least one source of truth where data corruption can be minimized or eliminated, giving the enterprise a recovery path that doesn’t jeopardize reporting and one that can restore the other Standbys should disaster strike. It’s likely that the current reports are run from a scheduler (on Windows or via cron on Unix/Linux) providing scripts that can be called at will to regenerate reports as necessary: failover, rebuild the remaining Standby databases and re-run the problem reports. Since the old Standby is now the new Primary the archived logs which contain the suspect transactions are no longer valid and the danger of accidentally applying them is averted.

Can multiple Standby databases have different delay settings?

Of course; the example above involves two Standbys, one with a delay of 0 minutes and one with a delay of 1440 minutes. Let’s go back to our wisely cautious DBA Dana and see how she configured three Standby databases for various uses.

Dana’s employer is a stock brokerage firm. It requires the use of Standby databases to feed data to various parts of the organization.

  • Brokers need current data to generate investment reports for clients
  • Sales can use data up to 24 hours old since they are building historical performance reports.
  • Portfolio managers have asked for data no more than two hours old to track futures.

Dana configures and builds three Standbys in separate physical locations — Topeka, KS, Reno, NV, and Baltimore, MD — to isolate them from each other and from the Primary (located in San Diego, CA.)

• Standby #1 is in Topeka, and it’s configured absent any delay; this is for the brokers.

• Standby #2 is in Reno, configured with a 2-hour delay for the portfolio managers.

• Standby #3 is in Baltimore, configured with a 24-hour delay for the sales team.

Now, looking at the log_archive_dest_N parameters Dana implemented we see:

log_archive_dest_2='SERVICE=stbyone'
log_archive_dest_state_2=enable
log_archive_dest_3='SERVICE=stbytwo, DELAY=120'
log_archive_dest_state_3=enable
log_archive_dest_4='SERVICE=stbythree, DELAY=1440'
log_archive_dest_state_4=enable

Now that Dana has the configuration parameters set the various Standbys can be started. Standby #1 uses the Standby redo logs so it can be started with the following command:

alter database recover managed Standby database disconnect from session;

Standbys #2 and #3 will need to use the archive logs rather than the Standby redo logs; Dana modifies her startup command:

alter database recover managed Standby database using archived logfile disconnect from session;

This allows Oracle to use the configured delay to buffer the recovery should a data-corrupting transaction or set of transactions be applied to the Primary.

Note that in this configuration, both the Primary and Standby #1 will be affected by data-corrupting input. Having the remaining Standbys using different delay settings allows for faster data recovery and a shorter recovery time (the two-hour delay Standby requires less redo to recover to just before the corruption hits, bringing the restored Primary online in less time thus starting the Standby rebuild process quicker).

Of course the Standby log destination configuration will change since Reno’s instance is now the new Primary database; San Diego will now become Standby #2. Once the Standby reconfiguration is complete, the applications that used Reno as a Standby will need to be reconfigured to use San Diego. This is a simple change taking minutes to complete since the tnsnames.ora file only needs to swap the connection information between San Diego and Reno. The service names can remain the same since no location names are used to define those services.

Should delaying Standby recovery be a standard practice?

Delaying Standby recovery may not be standard practice but for mission-critical databases maybe it should be. The ability to prevent data corruption from spreading across the Standby landscape can save the business time and money by providing a means to recover from such corruption in a short amount of time. The company gets back on track quickly and reliably by reducing the amount of duplicated work to enter current transactions again.

The key part of Disaster Recovery isn’t the disaster, it’s the recovery and the smoother that process can go the better it is for the enterprise. Some thought and preparation beforehand can better prepare the environment, and the DBA, to respond to that which the DBA hopes never happens. The Scout motto is “Be prepared.” Maybe DBAs should adopt it, too.

David Fitzjarrell
David Fitzjarrell has more than 20 years of administration experience with various releases of the Oracle DBMS. He has installed the Oracle software on many platforms, including UNIX, Windows and Linux, and monitored and tuned performance in those environments. He is knowledgeable in the traditional tools for performance tuning – the Oracle Wait Interface, Statspack, event 10046 and 10053 traces, tkprof, explain plan and autotrace – and has used these to great advantage at the U.S. Postal Service, American Airlines/SABRE, ConocoPhilips and SiriusXM Radio, among others, to increase throughput and improve the quality of the production system. He has also set up scripts to regularly monitor available space and set thresholds to notify DBAs of impending space shortages before they affect the production environment. These scripts generate data which can also used to trend database growth over time, aiding in capacity planning. He has used RMAN, Streams, RAC and Data Guard in Oracle installations to ensure full recoverability and failover capabilities as well as high availability, and has configured a 'cascading' set of DR databases using the primary DR databases as the source, managing the archivelog transfers manually and montoring, through scripts, the health of these secondary DR databases. He has also used ASM, ASMM and ASSM to improve performance and manage storage and shared memory.

Latest Articles