Oracle RAC Monitoring: Keeping your RAC under control

Brief intro

Our last article was about the Extended RAC–a way to spread out our RAC nodes across the city. Whether your RAC is in the building or outside, a smart and prudent DBA always monitors RAC.

As the application is taken into production, it grows and flourishes. That can be a big burden for the DBA if he is not proactively monitoring his database. There could be several causes for this growth: disk usage might go up, network bandwidth might bottleneck, transactions may start taking too long too commit, more users and more aggressive usage. This may be good for the business but the Service Level Agreements (SLAs) still need to be met. Proactive monitoring your Oracle RAC or even a typical single node Oracle database will keep you upwind of problems. However, that means you need to know which tools you should be using to do just that.

What are we monitoring?

The questions below can assist a DBA to optimize his routine administration tasks as well as help management make timely decisions.

1.      Are we meeting our SLA (Service Level Agreements)?

2.      Are the High Availability (HA) objectives being met?

3.      Are all instances sharing the load evenly?

4.      What is the interconnect load (latency/overhead)?

5.      CPU: Are Oracle processes getting enough resources?

6.      Memory: How is the memory for System Global Area (SGA) etc?

Questions similar to these, broken up to monitor at all levels (Application, Database, OS, and HW), help a DBA to monitor the RAC environment efficiently and effectively.

At hardware level: Disks, HBAs, NICs, cabling, backup devices etc. need to function properly. All these devices need to be configured properly.

At OS level: You need to monitor the CPU, Memory, Disk performance and Network traffic.

  • CPU (%idle time, etc.)

  • I/O (queue length)

  • Shared storage

  • network (both public and private network)

  • memory (paging, swapping, etc.)

  • logs (var/log/messages etc)

At Database level: You have to monitor all the cluster logs, event logs, asm logs and rdbms logs at your database level.

  • Cluster (ORA_CRS_HOME) and all related log files:

    • CRS alert log file

    • CRS logs: log/hostname/crsd

    • CSS logs: log/hostname/cssd

    • EVM logs: log/hostname/evmd & /log/hostname/evm/log

    • SRVM logs: log/hostname/client

    • OPMN logs: opmn/logs

    • Resource specific logs – /log/hostname/racg

  • ORACLE_HOME

    • Resource spec logs – /log/hostname/racg

    • SRVM logs- log/hostname/client

  • ASM

    • alert_SID.log : location: ORACLE_HOME/rdbms/log

  • Trace files:

    • bdump – background_dump_dest

    • cdump – core_dump_dest

    • udump – user_dump_dest

    • listener_<NODE>.log : ORACLE_HOME/network/log

Application Level:

At Application level, we need to carefully sprinkle the monitoring code (for instance, if the application server were a Tomcat Server or JBoss, then you would be interested in all of the Catalina logs, java.util logs or log4J logging, etc). There are tools, which are more professional, like BIRT, which can be employed to monitor your application’s performance.

The Toolkit

For the OS you will need the following tools:

  • top: Top Processes

  • ps: status of the processes

  • iostat: I/O stats

  • vmstat: Virtual Memory stats

  • netstat: network stats

  • ipcfg/ipconfig: checking IP address locally on nodes

  • ping: utility to ping across nodes

  • tracert: TRACERT is useful for troubleshooting large networks where several paths can be taken to arrive at the same point, or where many intermediate systems (routers or bridges) are involved.

  • nslookup: Ping IP addresses using DNS to lookup the nodes

Checking I/O stat

Let’s quickly check if our cluster is online. In our test/development scenario, I changed the disk configuration from RAID5 to RAID 0 for optimal performance of my RAC nodes. If you have an enterprise version of ESX, you can put all of the VMs on SAN and not have to bother with the disk configurations, as you will not need DAS (Direct Attached Storage).

Allowing my VMs for automatic start-up

Upon restarting, do the cluster check.

[oracle@vm01 bin]$ crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora.esxrac.db  application    ONLINE    ONLINE    vm02
ora....c1.inst application    ONLINE    ONLINE    vm01
ora....c2.inst application    ONLINE    ONLINE    vm02
ora....serv.cs application    ONLINE    ONLINE    vm02
ora....ac1.srv application    ONLINE    ONLINE    vm01
ora....ac2.srv application    ONLINE    ONLINE    vm02
ora....SM1.asm application    ONLINE    ONLINE    vm01
ora....01.lsnr application    ONLINE    ONLINE    vm01
ora.vm01.gsd   application    ONLINE    ONLINE    vm01
ora.vm01.ons   application    ONLINE    ONLINE    vm01
ora.vm01.vip   application    ONLINE    ONLINE    vm01
ora....SM2.asm application    ONLINE    ONLINE    vm02
ora....02.lsnr application    ONLINE    ONLINE    vm02
ora.vm02.gsd   application    ONLINE    ONLINE    vm02
ora.vm02.ons   application    ONLINE    ONLINE    vm02
ora.vm02.vip   application    ONLINE    ONLINE    vm02

Then do the iostat at intervals of 3 seconds. To see the full manual for this excellent tool, type man iostat.

[oracle@vm01 udump]$ iostat --help
sysstat version 5.0.5
(C) Sebastien Godard
Usage: iostat [ options... ] [ <interval> [ <count> ] ]
Options are:
[ -c | -d ] [ -k ] [ -t ] [ -V ] [ -x ]
[ { <device> [ ... ] | ALL } ] [ -p [ { <device> | ALL } ] ]
[oracle@vm01 udump]$ iostat
Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl)    05/03/2007
avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.49    0.07    5.81    2.32   90.30
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               2.18         2.60        76.77    2220396   65637664
sda1              9.65         2.60        76.76    2219427   65637656
sdb               0.00         0.00         0.00       2635          0
sdb1              0.00         0.00         0.00       1186          0
sdc               2.75        44.07        33.41   37681732   28563403
sdc1              2.76        44.07        33.41   37678487   28563403
sdd               1.71        10.44        33.41    8926315   28563403
sdd1              1.73        10.44        33.41    8923070   28563403
sde               1.12         1.26        22.43    1079253   19182446
sde1              1.13         1.26        22.43    1076008   19182446
sdf               4.31       502.78         4.18  429902055    3570958
sdf1              4.31       502.78         4.18  429900486    3570958
sdg               7.24      1004.57         5.15  858957930    4407357
sdg1              7.24      1004.57         5.15  858956361    4407357
sdh               1.00         1.00         0.50     858776     428077
sdh1              1.00         1.00         0.50     857207     428077
[oracle@vm01 udump]$ iostat -t 3
Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl)    05/03/2007
Time: 02:37:19 PM
avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.49    0.07    5.81    2.32   90.30
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               2.18         2.60        76.77    2220396   65645200
sda1              9.65         2.60        76.77    2219427   65645192
sdb               0.00         0.00         0.00       2635          0
sdb1              0.00         0.00         0.00       1186          0
sdc               2.75        44.07        33.40   37685412   28565854
sdc1              2.76        44.07        33.40   37682167   28565854
sdd               1.71        10.44        33.40    8926763   28565854
sdd1              1.73        10.44        33.40    8923518   28565854
sde               1.12         1.26        22.43    1079381   19183793
sde1              1.13         1.26        22.43    1076136   19183793
sdf               4.31       502.78         4.18  429949905    3571674
sdf1              4.31       502.78         4.18  429948336    3571674
sdg               7.24      1004.57         5.15  859052465    4408164
sdg1              7.24      1004.57         5.15  859050896    4408164
sdh               1.00         1.00         0.50     858870     428124
sdh1              1.00         1.00         0.50     857301     428124
Time: 02:37:22 PM
avg-cpu:  %user   %nice    %sys %iowait   %idle
           1.00    0.08    6.91    1.00   91.01
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.00         0.00         0.00          0          0
sda1              0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb1              0.00         0.00         0.00          0          0
sdc               2.00        42.67        13.33        128         40
sdc1              2.00        42.67        13.33        128         40
sdd               0.67         0.00        13.33          0         40
sdd1              0.67         0.00        13.33          0         40
sde               0.67         0.00        13.33          0         40
sde1              0.67         0.00        13.33          0         40
sdf               3.33       343.00         1.00       1029          3
sdf1              3.00       343.00         1.00       1029          3
sdg               6.33       856.33         2.67       2569          8
sdg1              6.67       856.33         2.67       2569          8
sdh               1.33         1.33         0.67          4          2
sdh1              1.33         1.33         0.67          4          2
Time: 02:37:25 PM
avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.83    0.08    5.67    1.75   91.67
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               0.66         0.00        74.42          0        224
sda1              9.30         0.00        74.42          0        224
sdb               0.00         0.00         0.00          0          0
sdb1              0.00         0.00         0.00          0          0
sdc               1.33        21.26        13.29         64         40
sdc1              1.33        21.26        13.29         64         40
sdd               0.66         0.00        13.29          0         40
sdd1              0.66         0.00        13.29          0         40
sde               0.66         0.00        13.29          0         40
sde1              0.66         0.00        13.29          0         40
sdf               4.32       512.62         1.66       1543          5
sdf1              4.32       512.62         1.66       1543          5
sdg               6.64      1023.26         1.99       3080          6
sdg1              6.31      1023.26         1.99       3080          6
sdh               0.66         0.66         0.33          2          1
sdh1              0.66         0.66         0.33          2          1
Time: 02:37:28 PM

Here sda, sdb are the files for OS installation and swap. Sdc, sdd and sde are the files used for OCR, Votingdisk and Spfileasm respectively. Sdf and Sdg are the files we chose for oradata (where all of our Oracle data files are residing) and sdh is for the flashback recovery. You can clearly see that iowait is considerably low, which is a good thing–had it been higher you would be looking at an I/O bottleneck. On the disks (devices section) you can clearly see that our cluster is doing fine but the oradata disks are working hard (and rightly so!). That is why I explained earlier that I have optimized my test scenario on the DAS to have more advantage on the spindle speed, seek time and throughput. The oradata files are on a separate disk. After getting your disks and data all evenly spread out, you can use the –x parameter to get additional useful information, such as average request size, average wait time for requests and average service time for requests.

Let’s run it with –x parameter.

[oracle@vm01 ~]$ iostat -x
Linux 2.6.9-42.0.0.0.1.ELsmp (vm01.wolga.nl)    05/04/2007
avg-cpu:  %user   %nice    %sys %iowait   %idle
           2.86    0.00   14.76    5.79   76.59
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda          8.02  13.87 13.55  7.73 1130.44  172.78   565.22    86.39    61.23     0.34   16.09   4.89  10.41
sdb          2.51   0.00  0.11  0.00    6.22    0.00     3.11     0.00    57.28     0.00    2.89   2.89   0.03
sdc          2.56   0.01  8.72  1.40  276.30   26.56   138.15    13.28    29.94     0.26   25.62   5.41   5.47
sdd          2.59   0.01  7.05  1.40  242.61   26.56   121.30    13.28    31.85     0.23   27.22   6.19   5.23
sde          2.56   0.12  0.55  1.00   19.11   75.21     9.55    37.60    60.91     0.02   15.33  13.81   2.14
sdf         10.18   0.00 16.07  3.45  472.50   21.01   236.25    10.51    25.29     0.06    3.20   3.06   5.97
sdg         10.18   0.00 17.19  3.98  748.88   21.55   374.44    10.77    36.40     0.07    3.30   3.02   6.39
sdh         10.18   0.00  1.01  0.42   87.56    0.69    43.78     0.35    61.40     0.01    7.00   6.97   1.00

Conclusion

As you can see, all you need is a good toolkit to monitor your system. We took a brief look at the iostat tool and in the upcoming articles we will take a more detailed look at other tools. There is also a utility called OSwatcher (downloadable at Metalink), which lumps all of the tools together so you can run them together.

» See All Articles by Columnist Tarry Singh

Tarry Singh
Tarry Singh
I have been active in several industries since 1991. While working in the maritime industry I have worked for several Fortune 500 firms such as NYK, A.P. Møller-Mærsk Group. I made a career switch, emigrated, learned a new language and moved into the IT industry starting 2000. Since then I have been a Sr. DBA, (Technical) Project Manager, Sr. Consultant, Infrastructure Specialist (Clustering, Load Balancing, Networks, Databases) and (currently) Virtualization/Cloud Computing Expert and Global Sourcing in the IT industry. My deep understanding of multi-cultural issues (having worked across the globe) and international exposure has not only helped me successfully relaunch my career in a new industry but also helped me stay successful in what I do. I believe in "worknets" and "collective or swarm intelligence". As a trainer (technical as well as non-technical) I have trained staff both on national and international level. I am very devoted, perspicacious and hard working.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles