Oracle: Preventing Corruption Before it's Too Late - Part 2 - Page 2
October 30, 2003
As stated earlier, corruption typically hits an Oracle database block in the memory or on the hard disk. Memory corruptions are handled with the Oracle kernel and operating system detection mechanism. This is not always a perfect solution; sometimes, good blocks are marked as corrupt. Disk corruptions are the most often seen type of corruption. Once written to the disk, the block is not checked again until it is requested. Oracle will follow block delivery until the block reaches operating system. After that, the operating system will take responsibility for block handling. However, the operating system might not make an "intelligent" block check, since it has no knowledge about the Oracle block structure.
Oracle provides several possibilities to defend against block corruptions:
a.) Disk protection
Oracle's main method for detecting Oracle block corruption is with the block checksum. After the block modification, the header bytes checksum is calculated by the database writer. On the next reread of the same bock, the block is validated. In addition to regular data and rollback blocks, the database also manages the redo log blocks. The redo log blocks hold undo information, which is necessary for transaction recovery. The Archive process (ARCH), as a part of Oracle's background process structure, will calculate checksum for the redo log block and compare it with the existing checksum information from the redo block header before making an entry into archive log file.
The checksum information represents the logical block structure. If a block has a standard Oracle block format, then the block is good. The process of checksum validation does not read the actual block content, therefore, there is still the possibility that data inside the block is corrupt.
Setting this parameter prevents a block from being written that does not confirm to the standard Oracle block format. If the block corruption is due to a change of format, that kind of the corruption will be discovered as a media error. The disadvantage of setting this parameter is that some blocks can be retry-able with the good data portion, but due to the header corruption, Oracle will mark the block as soft corrupted. Once a checksum is generated for a block, the block always uses checksums, even if the parameter is later removed.
Setting this instance parameter instructs Oracle to make a detailed logical block check, (not only header), before making any block changes. This block self-consistent checking prevents memory and disk corruptions. By default, the Oracle kernel always executes a detailed logical check for blocks that belong to the system tablespace. If corruption inside of the block content is detected, media error ORA-00600  or some variation of the media error, depending of the action completed on the block, will result. This condition will suspend database activity until the problem has been resolved. In addition, all data in the corrupted block is lost. There has been much discussion regarding how much detailed block checking will slow down the system. From my experience, the overhead was never more than 3%, even under a heavy load. Oracle says that we can expect from 1% to 10% overhead. Again, it all depends how much insert/delete activity you have on your Oracle database.
b.) Memory protection
Database block memory protection is enabled by leaving DB_BLOCK_CACHE_PROTECT on default settings (TRUE) and having the DB_BLOCK_CHECKSUM parameter enabled.
Oracle will use a special database algorithm to protect all modified blocks in the database memory. Any uncontrolled database writes inside the database memory area will be detected, protecting the disk from getting in memory corrupted blocks. After every successful disk read, the database block is checked in the database memory. The checksum information from the block header is compared with the memory checksum calculated on the fly. When they do not match, Oracle raises an error indicating a soft corrupted block, which will be reformatted and reused again.
Although in memory corruption detection is helpful, any in the memory corruption will crash the database instance. This parameter is hidden and by default always on.
c.) Additional parameters
Oracle Hardware Assisted Resilient Data (HARD)
Once the data block leaves Oracle's protected memory area, control is handed to the underlying operating system. Information stored in the Oracle data block will be checked again, but only on the next read. There is no guarantee that data was not changed before actually being written to the disk. Usually, this should not be a problem, since the operating system will use the underlying control mechanisms for data block checking. However, incomplete writes and corrupted blocks still, from time to time, occur on the disk device. For this reason, storage vendors began to provide Oracle validation and checking algorithms at the storage level. Can you imagine one EMC, or XP storage box, which, after writing to disk, reads the Oracle stored block and then does the Oracle checksum check? This whole initiative is called HARD, with the main target to physical protect data. Storage software products, certified according to the HARD standard are able to make:
With this methodology, it is possible to follow the Oracle block all the way to the physical disk, allowing us to prevent corruptions before they happen.
Corruption cannot be eliminated, but we can make every effort to discover the problem early. Having advanced system configurations increases the chance of having a block corruption on one of interconnected system layers. In addition, there are many more possible block corruption situations, with even more variations of the same errors, than were presented in this article.