Tivoli and RMAN in Oracle 11.2.0.3

Database backups are a necessary part of a robust management strategy, and ensuring the success of such processes is key to providing reliable disaster recovery operations. Several media choices are available for such backups, which can involve third-party tools such as NetBackup and Tivoli. Unfortunately it appears that RMAN backups may ‘miscommunicate’ with Tivoli providing incorrect file sizes causing the backups to fail, but only for standby databases. Oracle support has listed this as Bug 18009564, and it affects release 11.2.0.3. Reported initially in 2013 it’s still unresolved in 11.2.0.3; the MOS document reports the bug to be fixed in 12.2 with no backport patch to fix earlier releases. Let’s look at the situation as it was reported to MOS.

The problem was reported against an Oracle 11.2.0.3 database that is reported as being 28 TB in size. RMAN is compressing the backup, and the TSM storage pool is configured at 49 TB, which appears to be sufficient for the size of the affected database. According to the bug report TSM is ‘seeing’ much smaller file sizes, and because of that the backups are failing for space. RMAN supplies the estimated file size for each backup channel to the Media Management Library so that TDP can allocate sufficient space in the storage pool. If those numbers are smaller than the actual file sizes for each channel the storage pool allocation will be insufficient to complete the backup piece. A related bug, 17348361, affects versions 11.2.0.2 through 11.2.0.4 and is resolved with a one-off patch, but the patch was applied and the error still exists. RMAN reports the following error messages:


RMAN-3009: failure of backup command on ORA_SBT_TAPE_XX channel at 12/18/2013 00:43:32
ORA-19502: write error on file "0morpg92_1_1", block number 385 (block size=8192)
ORA-27030: skgfwrt: sbtwrite2 returned error
ORA-19511: Error received from media manager layer, error text:
ANS1311E (RC11)   Server out of data storage space

The reported error seems strange, given that the storage pool should be more than sufficient for the database size (more on this later). At first glance it appears the two bugs are related but looking deeper into the issue it becomes apparent that they are not. Looking at Bug 17348361 and the error it generates we see that Tivoli is reporting the object is too large to process, which is likely due to an error in the numbers RMAN is reporting to the Media Management Layer, this time on the high side. TSM determines the file to be much larger than the storage pool and fails to start the backup. In the ‘current’ bug TSM begins the backup process because the data its received says the object will fit in the storage pool; since the actual size of the backup piece exceeds the size reported to the MML by RMAN a physical write error terminates the backup.

As reported to MOS this doesn’t occur with the primary database, which is the same size. Of course the question of ‘why backup a standby database?’ comes to mind. Even in Active Standby mode no changes can be made to a physical standby outside of the redo application. Possibly this is a logical standby, but why directly update a logical standby outside of the APPLY process? The original SR apparently doesn’t deal with those questions, so let’s look for ways the standby processing differs from the production system, and what information may be available from PROD that isn’t populated in the standby.

One place RMAN gets information is with X$BH, the fixed view detailing information in the block headers. For an active database X$BH records the state of blocks (via the block headers); think of ‘state’ as how the block is being accessed. From the production database in a primary/standby configuration running the listed query the following results are returned (please note that the results will change for subsequent runs of the query):


SQL>
SQL> select decode(state,
  2             0, 'Free',
  3             1, 'Exclusive current',
  4             2, 'Shared current',
  5             3, 'Consistent read only',
  6             4, 'Read',
  7             5, 'Media recovery',
  8             6, 'Instance recovery',
  9             8, 'Past image',
 10             state) state, count(*)
 11  from x$bh
 12  group by state;

STATE                                           COUNT(*)
---------------------------------------- ---------------
Exclusive current                                  15706
Consistent read only                                   6

SQL>

Active transactions produce state changes in the block headers. Looking at the standby for this primary, using the same query:


SQL> select decode(state,
  2             0, 'Free',
  3             1, 'Exclusive current',
  4             2, 'Shared current',
  5             3, 'Consistent read only',
  6             4, 'Read',
  7             5, 'Media recovery',
  8             6, 'Instance recovery',
  9             8, 'Past image',
 10             state) state, count(*)
 11  from x$bh
 12  group by state;

no rows selected

SQL>

Being a physical standby it isn’t processing transactions (outside of the recovery ‘transactions’ used to apply the redo stream) and the block header state doesn’t change. This may impact the size determination RMAN makes for the size estimate of the backup. Another aspect of RMAN that doesn’t apply to SBT_TAPE backups that use a third-party application, such as Tivoli, but does apply to DISK backups and backups using Oracle Secure Backup is NULL compression. NULL compression (a term assigned in earlier incarnations of RMAN but remains to this day) is also called Unused Block Compression, done automatically by RMAN for disk-based backups that ignores data blocks that have never been used. For example a tablespace exists that is 150 MB in size, yet contains only 63 MB of data. With Unused Block Compression only the 63 MB will be processed in a backup to disk or a backup to tape using Oracle Secure Backup; backing that same tablespace up to tape using a third-party utility results in the entire 150 MB tablespace being processed, unused blocks included. Using third-party tape backup tools, such as NetBackup and Tivoli, rather than disk or Oracle Secure Backup more than doubled the size of the backup for this tablespace. This may be where the discrepancy occurs in the data supplied to Tivoli from the standby database — RMAN may be supplying piece sizes based on Unused Block Compression to the Media Management Library which would report a smaller piece size than is actually being processed. This would only occur in Active Standby Database or Logical Standby configurations, since a ‘plain-vanilla’ physical standby isn’t open. Following this train of thought it’s possible that the reported size is actually the expected size of the backup, minus unused space. Let’s surmise that the database is actually 50 TB in total size, with 28 TB of space actually used. Knowing that Unused Block Compression will not be implemented because this is using Tivoli to write to tape, that 28 TB of used space now becomes 50 TB of total space, which is greater than expected and also greater than the allocated storage pool. The advice supplied by the vendor, to increase the size of the storage pool, now makes sense since it’s no longer a 28 TB backup.

Is this actually a bug? I believe so, since one algorithm (albeit the ‘wrong’ one) appears to be used to estimate the backup size and another is actually being implemented to effect the backup. This results in a failure of the backup due to insufficient space. The workaround, if you will, is to increase the size of the storage pool Tivoli is using to be greater than the overall size of the standby database being backed up.

If it’s necessary to backup a standby database it’s good to know that third-party tape backup utilities may suffer from this sort of error. Being aware of the error and its possible cause should make it easier to work around the problem on versions of Oracle and RMAN earlier than 12.2.

See all articles by David Fitzjarrell

David Fitzjarrell
David Fitzjarrell
David Fitzjarrell has more than 20 years of administration experience with various releases of the Oracle DBMS. He has installed the Oracle software on many platforms, including UNIX, Windows and Linux, and monitored and tuned performance in those environments. He is knowledgeable in the traditional tools for performance tuning – the Oracle Wait Interface, Statspack, event 10046 and 10053 traces, tkprof, explain plan and autotrace – and has used these to great advantage at the U.S. Postal Service, American Airlines/SABRE, ConocoPhilips and SiriusXM Radio, among others, to increase throughput and improve the quality of the production system. He has also set up scripts to regularly monitor available space and set thresholds to notify DBAs of impending space shortages before they affect the production environment. These scripts generate data which can also used to trend database growth over time, aiding in capacity planning. He has used RMAN, Streams, RAC and Data Guard in Oracle installations to ensure full recoverability and failover capabilities as well as high availability, and has configured a 'cascading' set of DR databases using the primary DR databases as the source, managing the archivelog transfers manually and montoring, through scripts, the health of these secondary DR databases. He has also used ASM, ASMM and ASSM to improve performance and manage storage and shared memory.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles