Monitoring Oracle 10g RAC with Quest Spotlight on RAC - Part III
March 16, 2006
In the concluding part of this series, we will discuss the cluster latency and overhead alarms, the Global cache alarms and the ASM alarms.
I am beginning to get emails from people who are attempting to install RAC on their ESX servers for production servers. There, I suggest that they get on with tools like SoRAC and even benchmarking tools like Swingbench to evenly generate the load and make that database sweat!
Yes, you guessed it, we will get on to performance testing now!
Hear those alarm bells...
The challenge of managing a RAC cluster comes with great responsibility. Fortunately, we have SoRAC fitted with alarms to warn us about the problems in advance.
Cluster Latency and Overhead Alarms (CLA and COA)
Cluster latency and overhead alarms are primarily related to the high speed interconnect. When comparing a single Oracle instance against a RAC, we see that in a single instance when a user requests data in the form of a data block, it is first looked at on the "conveyer belt" (buffer with LRU and MRU), as I will call it here. It is checked in the buffer and if it is not found there, Oracle retrieves the data block directly from the disk, thus causing a disk I/O. Not a good thing as you would expect to tune your database to have the data block in the buffer. Whereas, in a RAC cluster, a session that requests information (data block) has to request the block from other instance(s) within the RAC. So actually it has already looked for the block in its own buffer and since the copy did not exist it goes ahead and requests the copy of the block from the other clustered instances via the interconnect (your HSI cable, a 1G or even a 10G cable). Now this block is transferred to the requesting instance from its buffer; if there is no available block for the requesting instance, it has to go to the disk to get that block.
All this requesting, transferring and waiting leads to what is called a "cluster overhead" and as the technical manual says:
The following print screen shows how both Latency and Overhead issues are rather prominent.
Moreover, normally both of these alarms are fired at the same time, as high latency is something that you see rather often during a cluster overhead. These two alarms can fire in the following scenarios:
Upon clicking on the Latency alarm, you can go to the drilldown, which will display a lot of information regarding the latency issues:
The top left graph shows the overall cluster latency: the large spike on the right of the graph, when averaged over time, has resulted in the latency alarm being fired. Note that this graph is only available for Oracle 10g clusters.
The ping graph shows the time required to transfer a packet across the HSI (High Speed Interconnect) cable.
All other graphs are a mere breakdown of the latency components such as "Total Latency," "Prepare Time," "Prepare Latency" and "Transfer Latency," all rather self-explanatory.
The same thing is seen when we click on the Overhead alarm, so see the drilldowns displaying a breakdown of the overhead issues: "The Cluster Overhead graph compares waits due to the RAC interconnect to non-RAC waits. The recent data shows the RAC waits as a significant percentage of the overall waits, resulting in the Overhead alarm being fired."
As you can see on the right, all the color coding says it all.
Global Cache Alarm (GCA)
GCA is further split into the Cache Miss Rate alarm, the Corrupt Blocks alarm and the Lost Blocks alarm. So what is a cache miss? When a user requests a data block, the block is cached in the SGA for re-use by other sessions until such time as the block is not used and the space is required for other data. A "cache miss" occurs when a data block requested by the user session is not found in the local SGA, and the block has to be obtained from a remote instance, or read from disk.
The Cache Miss Rate metric indicates the percentage of such requests that cannot be satisfied from the local cache, and it directly affects the overall performance of the cluster. When it exceeds a threshold, a Global Cache Local Miss Rate alarm is fired. Possible causes of this alarm are:
Check out the print screen:
The Miss Type graph gives a breakdown of the misses for Current Block requests (for an Update operation) or for Consistent Block requests (for a Read operation). The other three graphs delve in further details such as the total misses, current blocks misses and consistent blocks misses.
Now it is a totally different story with corrupt blocks. Yes, we do have them too, dont we? SoRAC raises separate alarms for corrupt and lost blocks. As the manual states,
Check out this print screen:
As you can see, there are blocks lost (top left).
Well, we know by now what ASM is. And yes, we do have alarms related to ASM as well. This print screen shows a total I/O activity on an ASM instance.
SoRAC can raise an ASM Service Time alarm when the average time taken to perform a read on the ASM instance has exceeded a specified threshold value. This indicates that the ASM subsystem is performing poorly. "Service time represents the response time of a disk plus any wait time experienced during disk operations."
As you can see clearly below, the two graphs show I/O throughput and service time overhead.
In this last SoRAC article, we have seen that we are using a highly intelligent tool, which displays all of the information graphically. Needless to say, it is indeed a very handy tool for DBAs. So, what are you waiting for, go ahead and try it out!