Oracle: Preventing Corruption Before it's Too Late: Part 1 - Page 2

September 30, 2003

Oracle Hard Corruption

Since Oracle hard corruption points to problems with the magnetic media, this corruption is also called physical or media error.

The first evidence of possible problems can be found in the machine's system log:

WARNING: /sbus@2,0/SUNW,soc@0,0/
SUNW,pln@b0000000,912cec/ssd@2,0 (ssd117):
Error for Command: read(10)     Error Level: Retryable
Requested Block: 34274544       Error Block: 34274544
Vendor: SEAGATE                 Serial Number: 03421655
Sense Key: Hardware Error

This error message indicates that something is going on with a disk device, and possibly we can expect some problem on the database block level.

Let's look in global, to see how different hardware components interact with Oracle block:

System Memory is the main machine memory from which is allocated the database SGA (System Global Area) memory area. SGA is allocated on database start up and is used as cache for all database operations. The second part of the Oracle memory model is the sessions' PGA, where sessions store their operational environment. Moving of the Oracle data blocks is under the control of the database writer process, which will interact with the operating system during block take over. Operating systems will immediately return a response to the database writer, and in the background continue block handling, all the way to the physical device. We are talking about standard UNIX file system implementation.

Faulty working memory modules, if detected by the OS memory detection mechanism, can cause database block corruptions. Corruption happens in the database memory, the operating system buffer cache or in the file system IO buffers.

Memory corruptions could snip to the disk device before the operating system detect memory parity error. Upon detection, the operating system will write warning messages in the system's log and start with defensive action.

Some famous media corruption errors:

ORA-00600: internal error code, arguments: [3374], [], [], [], [], [], [], []

If you get an ORA-00600 [3374], it means that you have encountered a corruption in the

memory. Shutting down and restarting the database instance could clear the problem. If not, you will need to call Oracle Support for the procedure on how to write corrupted block out to the disk.

Upon instance startup, on touching corrupted object, Oracle will raise ORA-01578 error, indicating media corruption.

ORA-00600: internal error code, arguments: [3398], [15727], [15742], [], [], [], [], []

ORA-00600 [3398] database error indicating that the database writer process has detected a corrupted block in the cache and will crash the instance.

ORA-00600: internal error code, arguments: [4147], [], [], [], [], [], [], []

ORA-00600 [4147] indicates memory corruption with rollback segment blocks (invalid SCN) most probably due to a lost write to the rollback segment.

ORA-00600: internal error code, arguments: [4193], [15727], [15742], [], [], [], [], []

ORA-600[4193] indicates corruption in the rollback segment, when the transaction table and the rollback block are out of sync.

ORA-00600: internal error code, arguments: [12700], [], [], [], [], [], [], []
Block Checking: DBA = 16794980, Block Type = KTB-managed data block 
kdbchk: bad row tab 0, slot 42   -> bad row data 
kdrchk: row is marked as both Last and Next continue 

This error indicates corruption in the index, table or the mapping between them. In principle, index ROWID is pointing to a non-existent row in the data block. Corruption of this type can be in the data or in the index block. Upon discovery, the whole block will be marked corrupt and we will not have normal access to data from the block. Marking a block as corrupted may break referential integrity constraints and the object free list may become inaccessible, depending upon the location of the corrupted block.

Disk Controller is a hardware device (SCSI, SSA, Raid...), equipped with an on board cache used to control communication between the physical disk drive and the operating system IO calls. A faulty controller or faulty firmware on the controller board can cause corruptions on this level.

For example, here is error message from a database instance that crashed because of the database writer error condition:

"DBWR failed to complete async write within 183 seconds".

Oracle was attempting to write data to the disk. In normal circumstances, ASYNC IO call will return immediately control to Oracle. However, when there is problem with hardware, the I/O request will timeout after 180 seconds, logging a message in the database alert log and database writer trace file. The next attempt to write data will be in 360 seconds, and if it fails again, the database writer will terminate the Oracle instance.

Another situation with disk controller can occur when the controller is working correctly, but a controller bottleneck caused system write errors to be logged.

Disk Device represents a physical, mechanical device used for storing data. Disk devices have limited MTBF (Mean time before failure) and we can expect them to malfunction sooner or later. Even more critical than mechanical problems with disk devices, are disk area corruption problems. With disk area corruption problems, it is not possible to reread previously saved data. These physical errors are handled very well by the underlying operating system. The operating system will detect and solve most of these problems regularly in the background.

A typical Oracle error, following corruption due to a problem with the disk device:

ORA-00600: Internal message code, arguments: [01578] [...] [...] [] [] [].

select count(*) from artist_test;
ERROR:
ORA-01578: ORACLE data block corrupted (file # 7, block # 128239) -> 7 is relative file number
ORA-01110: data file 22: '/oracle/artist/artist01.dbf' -> 22 is absolute file number

Oracle message ORA-01578, indicates a media block corruption. Each time a SQL statement tries to access (read or write) the corrupted block, Oracle will signal an error.

ORA-01578 usually comes with the ORA-0110 error indicating the file name and absolute file number. Several occurrences of ORA-1578 errors, always with the same arguments, definitely points to a media error. When the ORA-1578 error arises with different arguments, we are dealing with some other system error, possibly a problem with memory, I/O or some sort of swap problem.

The operating system will try to repair the corrupted block. When it has succeed, the block is zeroed, preventing Oracle from identifying the data block content.

Others - Many different reasons can cause Oracle block corruptions. For example, different operating system bugs or the usage of some disk repair utilities.

Conclusion:

Hard disk and memory media errors are the types of system errors that occur most often. A good backup strategy should be enough for a database administrator to keep the situation under control. These situations offer a good opportunity for a DBA to impress others, proving his skills and making a fast, efficient database recovery.

» See All Articles by Columnist Marin Komadina








The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers