Database backups are a necessary part of a robust management strategy, and ensuring the success of such processes is key to providing reliable disaster recovery operations. Several media choices are available for such backups, which can involve third-party tools such as NetBackup and Tivoli. Unfortunately it appears that RMAN backups may ‘miscommunicate’ with Tivoli providing incorrect file sizes causing the backups to fail, but only for standby databases. Oracle support has listed this as Bug 18009564, and it affects release 11.2.0.3. Reported initially in 2013 it’s still unresolved in 11.2.0.3; the MOS document reports the bug to be fixed in 12.2 with no backport patch to fix earlier releases. Let’s look at the situation as it was reported to MOS.
The problem was reported against an Oracle 11.2.0.3 database that is reported as being 28 TB in size. RMAN is compressing the backup, and the TSM storage pool is configured at 49 TB, which appears to be sufficient for the size of the affected database. According to the bug report TSM is ‘seeing’ much smaller file sizes, and because of that the backups are failing for space. RMAN supplies the estimated file size for each backup channel to the Media Management Library so that TDP can allocate sufficient space in the storage pool. If those numbers are smaller than the actual file sizes for each channel the storage pool allocation will be insufficient to complete the backup piece. A related bug, 17348361, affects versions 11.2.0.2 through 11.2.0.4 and is resolved with a one-off patch, but the patch was applied and the error still exists. RMAN reports the following error messages:
RMAN-3009: failure of backup command on ORA_SBT_TAPE_XX channel at 12/18/2013 00:43:32
ORA-19502: write error on file "0morpg92_1_1", block number 385 (block size=8192)
ORA-27030: skgfwrt: sbtwrite2 returned error
ORA-19511: Error received from media manager layer, error text:
ANS1311E (RC11) Server out of data storage space
The reported error seems strange, given that the storage pool
As reported to MOS this doesn’t occur with the primary database, which is the same size. Of course the question of ‘why backup a standby database?’ comes to mind. Even in Active Standby mode no changes can be made to a physical standby outside of the redo application. Possibly this is a logical standby, but why directly update a logical standby outside of the APPLY process? The original SR apparently doesn’t deal with those questions, so let’s look for ways the standby processing differs from the production system, and what information may be available from PROD that isn’t populated in the standby.
One place RMAN gets information is with X$BH, the fixed view detailing information in the block headers. For an active database X$BH records the state of blocks (via the block headers); think of ‘state’ as how the block is being accessed. From the production database in a primary/standby configuration running the listed query the following results are returned (please note that the results will change for subsequent runs of the query):
SQL>
SQL> select decode(state,
2 0, 'Free',
3 1, 'Exclusive current',
4 2, 'Shared current',
5 3, 'Consistent read only',
6 4, 'Read',
7 5, 'Media recovery',
8 6, 'Instance recovery',
9 8, 'Past image',
10 state) state, count(*)
11 from x$bh
12 group by state;
STATE COUNT(*)
---------------------------------------- ---------------
Exclusive current 15706
Consistent read only 6
SQL>
Active transactions produce state changes in the block headers. Looking at the standby for this primary, using the same query:
SQL> select decode(state,
2 0, 'Free',
3 1, 'Exclusive current',
4 2, 'Shared current',
5 3, 'Consistent read only',
6 4, 'Read',
7 5, 'Media recovery',
8 6, 'Instance recovery',
9 8, 'Past image',
10 state) state, count(*)
11 from x$bh
12 group by state;
no rows selected
SQL>
Being a physical standby it isn’t processing transactions (outside of the recovery ‘transactions’ used to apply the redo stream) and the block header state doesn’t change. This may impact the size determination RMAN makes for the size estimate of the backup. Another aspect of RMAN that doesn’t apply to SBT_TAPE backups that use a third-party application, such as Tivoli, but does apply to DISK backups and backups using Oracle Secure Backup is NULL compression. NULL compression (a term assigned in earlier incarnations of RMAN but remains to this day) is also called Unused Block Compression, done automatically by RMAN for disk-based backups that ignores data blocks that have never been used. For example a tablespace exists that is 150 MB in size, yet contains only 63 MB of data. With Unused Block Compression only the 63 MB will be processed in a backup to disk or a backup to tape using Oracle Secure Backup; backing that same tablespace up to tape using a third-party utility results in the entire 150 MB tablespace being processed, unused blocks included. Using third-party tape backup tools, such as NetBackup and Tivoli, rather than disk or Oracle Secure Backup more than doubled the size of the backup for this tablespace. This may be where the discrepancy occurs in the data supplied to Tivoli from the standby database — RMAN may be supplying piece sizes based on Unused Block Compression to the Media Management Library which would report a smaller piece size than is actually being processed. This would only occur in Active Standby Database or Logical Standby configurations, since a ‘plain-vanilla’ physical standby isn’t open. Following this train of thought it’s possible that the reported size is actually the expected size of the backup, minus unused space. Let’s surmise that the database is actually 50 TB in total size, with 28 TB of space actually used. Knowing that Unused Block Compression will not be implemented because this is using Tivoli to write to tape, that 28 TB of used space now becomes 50 TB of total space, which is greater than expected and also greater than the allocated storage pool. The advice supplied by the vendor, to increase the size of the storage pool, now makes sense since it’s no longer a 28 TB backup.
Is this actually a bug? I believe so, since one algorithm (albeit the ‘wrong’ one) appears to be used to estimate the backup size and another is actually being implemented to effect the backup. This results in a failure of the backup due to insufficient space. The workaround, if you will, is to increase the size of the storage pool Tivoli is using to be greater than the overall size of the standby database being backed up.
If it’s necessary to backup a standby database it’s good to know that third-party tape backup utilities may suffer from this sort of error. Being aware of the error and its possible cause should make it easier to work around the problem on versions of Oracle and RMAN earlier than 12.2.