Brief intro
I have a Google alert running for Oracle RAC and besides
pulling the regular vendor offerings, I receive a lot of alerts on how often Oracle
RAC is being adopted into the enterprise. I mentioned commoditization in my previous
article, but there are a lot of pressures and several odd and painful reminders
that RAC needs a good administrator; and still things can go wrong and they
can go very badly wrong. RAC is no longer exclusive for huge
data- centers; it is being deployed in SMB environments as well. Since the need
to administer the database will require in-house expertise, it is becoming increasingly
important that we practice the installation and administration of our Oracle
RAC on our VMware Server /and or ESX test bed."
So, let’s pick up where we left off in our previous article
on Clusterware administration.
Administering the OCR Using OCR Backup Files
We will take a quick look at two methods described for
copying the Oracle Cluster Registry (OCR) and recovering it. Oracle
Clusterware automatically creates OCR backups every four hours and it always
retains the last three backup copies of the OCR. The CRSD process
that creates the backups also creates and retains an OCR backup for each full
day and then at the end of a week a complete backup for the week. So
there is a robust backup taking place in the background. And you guessed it right;
you cannot alter the backup frequencies. This is meant to protect you, the DBA,
so that you can copy these generated backup files at least once daily to a
different device from where the primary OCR resides. These files are located at
%CRS_home/cdata/my_cluster.
Restoring the OCR from generated OCR Backups
Given that most of us run our Oracle RAC on limited hardware,
on a VMware Server or ESX Server, it is no surprise to see applications
failing. Always try to restart the application first. To verify the failure run
an ocrcheck. The next step is to fix the problem.
On Unix/Linux Systems
Lets do the following to restore our OCR on Unix/Linux
Systems.
- To show the backups, type the commands ocrconfig –showbackup
-
Check the contents by doing
ocrdump -backupfile
my_file -
Go to bin and stop the CRS.
crs stop on all nodes. -
Perform the restore ocrconfig
–restore my_file -
Restart the nodes crs
start -
We have spoken and seen the CVU (Cluster Verification Utility)
play a crucial role during installation in our RAC on VMware Series. Check the
OCR’s integrity. Get a verbose output of all of the nodes by doing this: cluvfy comp ocr –n all -verbose
On Windows Systems
-
Do the same as above. Check the OCR backups using the
ocrconfig -showbackup
command. Verify
the contents of the backup usingocrdump
my_file where my_name
-backupfile
is your backup file. -
Disable the OCR clients on all nodes by stopping the following
services from the Service Control Panel:OracleClusterVolumeService
,
OracleCSService
,OracleCRService
, and theOracleEVMService
. -
Restore the OCR backup file with the following command
ocrconfig
-restore
mfile name
command. Always check to see if the OCR devices exist! -
Start all of the services. Restart all of the nodes to bring the
cluster alive. - To check the integrity, do the following with the CVU: cluvfy comp ocr -n all -verbose
Overruling the OCR (Oracle Cluster Registry) Data Loss Protection Machinery
Oracle Clusterware is robustly built and allows for minimal
error. An overwrite can throw RAC out of balance. If your OCR cannot access its
mirrored files and for some reason is not able to verify the location of the
OCR files (It could be anything, a temporary bottleneck in your SAN Virtual
Disks or local shared disks which you chose particularly for your OCR, in any
case some temporary glitch), then your OCR prevents further modification to
the available OCR. The data protection mechanism prohibits
the Clusterware from starting on the node where you have your OCR; Oracle throws
an error on your Enterprise Manager and Clusterware alert log files. If the
problem persists in just one node, (all that information is displayed neatly in
your Enterprise Manager and Clusterware log files errors–Error messages like CLSD-1009
or CLSD-1011), try to restart the node(s).
If that does not work and you cannot repair the OCR, then you
are left with no other option except overriding the protection mechanism. Do
not use it in the first instance! Oracle CRS is robust enough to check and poll
the files appropriately. Be warned that data loss may occur (and here I mean
that the OCR updates will be lost from the time of your last known successful
update. So if you are attempting to make changes to configuration using
the following command: ocrconfig –overwrite, then the last good known
configuration will be lost.
How to Override:
-
Check and compare the error message output with the Windows
registry OR ocr.loc on Unix/Linux. If they don’t match then try to repair using
ocrconfig –repair
. -
Use
OCRDUMP
(we
will look more into OCRDUMP later in our Administration series) command to dump
all information regarding the OCR configuration and check if the updates are
latest. -
If you can’t resolve the error messages (CLSD) then do the
following:ocrconfig -overwrite
to bring the node back to life.
Conclusion:
We have taken
a quick look at the Clusterware’s administration. We also took at look at the
override possibilities to force restore the OCR files when the OCR’s built in
protection mechanism prohibits the automatic restore of the same. Future
articles will go into more hands-on training. I have read the Oracle manual
several times and have quoted it often. I advise you to go through the manual
more than once. There is different documentation on RAC, even books, but
nothing comes close to the Oracle Documentation–and Oracle just made that freely
available now! So go ahead , download the PDF books, get the free VMware Server (or Trial ESX 3.0 as its called Virtual
Infrastructure 3) , ask your boss for an old server and go do some magic
with VMware!