Database Recommendations
As stated
earlier, corruption typically hits an Oracle database block in the memory or on
the hard disk. Memory corruptions are handled with the Oracle kernel and
operating system detection mechanism. This is not always a perfect solution; sometimes,
good blocks are marked as corrupt. Disk corruptions are the most often seen
type of corruption. Once written to the disk, the block is not checked again
until it is requested. Oracle will follow block delivery until the block reaches
operating system. After that, the operating system will take responsibility for
block handling. However, the operating system might not make an "intelligent"
block check, since it has no knowledge about the Oracle block structure.
Oracle provides
several possibilities to defend against block corruptions:
a.)
Disk
protection
DB_BLOCK_CHECKSUM=TRUE
Oracle's
main method for detecting Oracle block corruption is with the block checksum.
After the block modification, the header bytes checksum is calculated by the database
writer. On the next reread of the same bock, the block is validated. In
addition to regular data and rollback blocks, the database also manages the redo
log blocks. The redo log blocks hold undo information, which is necessary for
transaction recovery. The Archive process (ARCH), as a part of Oracle's
background process structure, will calculate checksum for the redo log block
and compare it with the existing checksum information from the redo block header
before making an entry into archive log file.
The checksum
information represents the logical block structure. If a block has a standard
Oracle block format, then the block is good. The process of checksum validation
does not read the actual block content, therefore, there is still the
possibility that data inside the block is corrupt.
Setting
this parameter prevents a block from being written that does not confirm to the
standard Oracle block format. If the block corruption is due to a change of
format, that kind of the corruption will be discovered as a media error. The disadvantage
of setting this parameter is that some blocks can be retry-able with the good
data portion, but due to the header corruption, Oracle will mark the block as
soft corrupted. Once a checksum is generated for a block, the block always uses
checksums, even if the parameter is later removed.
DB_BLOCK_CHECKING=TRUE
Setting
this instance parameter instructs Oracle to make a detailed logical block check,
(not only header), before making any block changes. This block self-consistent
checking prevents memory and disk corruptions. By default, the Oracle kernel
always executes a detailed logical check for blocks that belong to the system tablespace.
If corruption inside of the block content is detected, media error ORA-00600
[6599] or some variation of the media error, depending of the action completed
on the block, will result. This condition will suspend database activity until
the problem has been resolved. In addition, all data in the corrupted block is
lost. There has been much discussion regarding how much detailed block checking
will slow down the system. From my experience, the overhead was never more than
3%, even under a heavy load. Oracle says that we can expect from 1% to 10%
overhead. Again, it all depends how much insert/delete activity you have on
your Oracle database.
b.)
Memory
protection
_DB_BLOCK_CACHE_PROTECT=TRUE
Database
block memory protection is enabled by leaving DB_BLOCK_CACHE_PROTECT on
default settings (TRUE) and having the DB_BLOCK_CHECKSUM parameter enabled.
Oracle
will use a special database algorithm to protect all modified blocks in the
database memory. Any uncontrolled database writes inside the database memory
area will be detected, protecting the disk from getting in memory corrupted
blocks. After every successful disk read, the database block is checked in the
database memory. The checksum information from the block header is compared
with the memory checksum calculated on the fly. When they do not match, Oracle
raises an error indicating a soft corrupted block, which will be reformatted
and reused again.
Although
in memory corruption detection is helpful, any in the memory corruption will
crash the database instance. This parameter is hidden and by default always on.
c.)
Additional
parameters
_CHECK_BLOCK_AFTER_CHECKSUM - perform block check after checksum, by default set ON, and activated
only when DB_BLOCK_CHECKSUM=TRUE
_DB_ALWAYS_CHECK_SYSTEM_TS - always perform block check and checksum for the system tablespace,
by default set to ON
Oracle Hardware Assisted Resilient Data (HARD)
Once the
data block leaves Oracle's protected memory area, control is handed to the
underlying operating system. Information stored in the Oracle data block will
be checked again, but only on the next read. There is no guarantee that data
was not changed before actually being written to the disk. Usually, this should
not be a problem, since the operating system will use the underlying control
mechanisms for data block checking. However, incomplete writes and corrupted
blocks still, from time to time, occur on the disk device. For this reason, storage
vendors began to provide Oracle validation and checking algorithms at the
storage level. Can you imagine one EMC, or XP storage box, which, after writing
to disk, reads the Oracle stored block and then does the Oracle checksum check?
This whole initiative is called HARD, with the main target to physical protect data. Storage
software products, certified according to the HARD standard are able to make:
-
validation of
checksum and selected block fields for the datafile blocks
-
validation of
checksum and selected block fields for the redo log blocks
-
validation of
checksum and selected block fields for the controlfile blocks
-
validation of
single block write
With this
methodology, it is possible to follow the Oracle block all the way to the
physical disk, allowing us to prevent corruptions before they happen.
Conclusion
Corruption
cannot be eliminated, but we can make every effort to discover the problem
early. Having advanced
system configurations increases the chance of having a block corruption on one
of interconnected system layers. In addition, there are many more possible block corruption
situations, with even more variations of the same errors, than were presented
in this article.
»
See All Articles by Columnist Marin Komadina