Occasionally bugs creep in to code trees and end up in production. Oracle is not immune to this eventuality, and in Oracle Enterprise Manager (OEM), the agents can exhibit some interesting behavior with regard to Perl. The OEM 13c agent, unfortunately, has a bug that causes some Perl processes to consume large amounts of CPU. Let’s discuss this and investigate the patch that Oracle has provided.
The OEM agent for the 13c version of Enterprise Manager can, on some platforms, have Perl processes that appear to ‘run away’ with the CPU, causing CPU alerts in monitoring software and leaving the DBAs to scratch their heads in confusion, especially if the DBA is not familiar with a common UNIX/Linux tool named top. Top reports on various aspects of the system, usually with averaged numbers, giving a relatively decent view of the server performance. Looking at a typical top session we see the following:
top - 14:47:11 up 15 days, 21:16, 15 users, load average: 5.71, 5.16, 3.65
Tasks: 514 total, 1 running, 513 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.1 us, 1.8 sy, 0.0 ni, 48.1 id, 48.9 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32944620 total, 224868 free, 2903176 used, 29816576 buff/cache
KiB Swap: 4194300 total, 4067628 free, 126672 used. 6357708 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13168 oracle 20 0 10.0g 1.4g 1.3g D 4.0 4.3 10:12.54 ora_dw00_splerb
4965 splunk 20 0 220432 56836 5588 S 1.3 0.2 74:42.96 splunkd
11304 oracle 20 0 9.9g 57820 54976 S 1.3 0.2 2:43.72 ora_vkrm_splerb
574 root 20 0 13228 1700 1492 S 1.0 0.0 22:33.62 rngd
16727 oracle 20 0 160484 4840 3732 R 1.0 0.0 0:00.16 top
11312 oracle 20 0 9.9g 1.6g 1.6g D 0.7 5.1 0:43.54 ora_dbw0_splerb
11373 oracle 20 0 9.9g 229680 216964 S 0.7 0.7 4:10.27 ora_cjq0_splerb
493 root 0 -20 0 0 0 S 0.3 0.0 9:21.23 kworker/1:1H
11276 oracle 20 0 9.9g 57240 54408 S 0.3 0.2 1:02.66 ora_psp0_splerb
11807 oracle 20 0 2189184 322576 4176 S 0.3 1.0 56:43.91 java
29163 oracle 20 0 3657680 85792 81204 S 0.3 0.3 2:10.49 ora_scmn_apnbbu
29187 oracle 20 0 3438312 187604 184436 S 0.3 0.6 1:41.53 ora_ckpt_apnbbu
30400 oracle 20 0 9.9g 61796 59024 S 0.3 0.2 0:01.10 ora_pmon_aecm
1 root 20 0 125752 4836 3108 S 0.0 0.0 18:11.47 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.38 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:40.76 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 9:31.27 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root 20 0 0 0 0 S 0.0 0.0 8:38.35 rcuos/0
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0
11 root rt 0 0 0 0 S 0.0 0.0 0:01.45 migration/0
12 root rt 0 0 0 0 S 0.0 0.0 0:04.62 watchdog/0
13 root rt 0 0 0 0 S 0.0 0.0 0:04.34 watchdog/1
14 root rt 0 0 0 0 S 0.0 0.0 0:01.42 migration/1
15 root 20 0 0 0 0 S 0.0 0.0 0:40.30 ksoftirqd/1
17 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
18 root 20 0 0 0 0 S 0.0 0.0 2:30.77 rcuos/1
19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/1
20 root rt 0 0 0 0 S 0.0 0.0 0:06.24 watchdog/2
21 root rt 0 0 0 0 S 0.0 0.0 0:01.95 migration/2
22 root 20 0 0 0 0 S 0.0 0.0 3:39.80 ksoftirqd/2
24 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
25 root 20 0 0 0 0 S 0.0 0.0 2:39.00 rcuos/2
26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/2
27 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 khelper
28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
An overview of system statistics and performance is in the first five lines of the report, giving the total number of tasks, uptime, CPU usage (updated every 5 seconds on some Linux systems), the load averages over periods of 1, 5 and 15 minutes, memory and swap used and available. The remainder of the output shows top processes as they are executing and the process on the top of the list usually changes with each refresh of the screen. It’s not abnormal for a top report to keep one process at the top for two or more refreshes, but processes that stay at the top of the output through many screen refreshes are usually suspect. Many times, such processes are showing that the CPU usage is 80% or higher, with 99% not unheard of for processes ‘wildly’ consuming CPU.
In the case of the OEM agent for version 13c a singular Perl process can ‘roost’ at the top of the process list at a CPU usage of 99%. In such a case the top output could appear as shown below:
top - 14:47:11 up 15 days, 21:16, 15 users, load average: 5.71, 5.16, 3.65
Tasks: 514 total, 1 running, 513 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.1 us, 1.8 sy, 0.0 ni, 48.1 id, 48.9 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32944620 total, 224868 free, 2903176 used, 29816576 buff/cache
KiB Swap: 4194300 total, 4067628 free, 126672 used. 6357708 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22869 oracle 20 0 10.0g 1.4g 1.3g D 99.7 4.3 0:02.47 perl
4965 splunk 20 0 220432 56836 5588 S 1.3 0.2 74:42.96 splunkd
11304 oracle 20 0 9.9g 57820 54976 S 1.3 0.2 2:43.72 ora_vkrm_splerb
574 root 20 0 13228 1700 1492 S 1.0 0.0 22:33.62 rngd
16727 oracle 20 0 160484 4840 3732 R 1.0 0.0 0:00.16 top
11312 oracle 20 0 9.9g 1.6g 1.6g D 0.7 5.1 0:43.54 ora_dbw0_splerb
11373 oracle 20 0 9.9g 229680 216964 S 0.7 0.7 4:10.27 ora_cjq0_splerb
493 root 0 -20 0 0 0 S 0.3 0.0 9:21.23 kworker/1:1H
11276 oracle 20 0 9.9g 57240 54408 S 0.3 0.2 1:02.66 ora_psp0_splerb
11807 oracle 20 0 2189184 322576 4176 S 0.3 1.0 56:43.91 java
29163 oracle 20 0 3657680 85792 81204 S 0.3 0.3 2:10.49 ora_scmn_apnbbu
29187 oracle 20 0 3438312 187604 184436 S 0.3 0.6 1:41.53 ora_ckpt_apnbbu
30400 oracle 20 0 9.9g 61796 59024 S 0.3 0.2 0:01.10 ora_pmon_aecm
1 root 20 0 125752 4836 3108 S 0.0 0.0 18:11.47 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.38 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:40.76 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 9:31.27 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root 20 0 0 0 0 S 0.0 0.0 8:38.35 rcuos/0
10 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/0
11 root rt 0 0 0 0 S 0.0 0.0 0:01.45 migration/0
12 root rt 0 0 0 0 S 0.0 0.0 0:04.62 watchdog/0
13 root rt 0 0 0 0 S 0.0 0.0 0:04.34 watchdog/1
14 root rt 0 0 0 0 S 0.0 0.0 0:01.42 migration/1
15 root 20 0 0 0 0 S 0.0 0.0 0:40.30 ksoftirqd/1
17 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
18 root 20 0 0 0 0 S 0.0 0.0 2:30.77 rcuos/1
19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/1
20 root rt 0 0 0 0 S 0.0 0.0 0:06.24 watchdog/2
21 root rt 0 0 0 0 S 0.0 0.0 0:01.95 migration/2
22 root 20 0 0 0 0 S 0.0 0.0 3:39.80 ksoftirqd/2
24 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
25 root 20 0 0 0 0 S 0.0 0.0 2:39.00 rcuos/2
26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcuob/2
27 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 khelper
28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
The Perl process, collecting configuration data from the Oracle Home of a database is consuming record amounts of CPU and to clear the CPU usage error may need to be killed. This is the result of bug 26113746, and a patch is available through Oracle support. Unfortunately, the patch can’t be applied to the agents without applying a different patch, 25163555, to the Middleware home of the OEM server. This patch fixes an issue with the agent patch that caused the Agent Upload process to fail to update the OEM metadata because the 13.2 metadata type is missing. Both patches are required to fix the Perl issue in the OEM 13c Agent.
Applying those patches will not be discussed here as that information can be found in the following documents:
https://support.oracle.com/epmos/faces/DocumentDisplay?id=2243040.1&displayIndex=7#CAUSE
https://support.oracle.com/epmos/faces/DocumentDisplay?id=2249682.1&displayIndex=3#FIX
Unexpected CPU spikes are disconcerting, to say the least. When they are the result of a rogue Perl process (usually a very tame executable) things can get exasperating since top doesn’t display much in the way of useful information. Querying V$SESSION is where the DBA needs to begin as the relevant information is found in the MODULE column. EM Agent processes will be listed as ’emagent_SQL_oracle_database’ and should map to the offending O/S process (join V$SESSION.PADDR to V$PROCESS.ADDR to return the relevant SPID):
SQL > select spid from v$session, v$process where paddr = addr and module = 'emagent_SQL_oracle_database';
SPID
------------------------
22869
3788
SQL >
Knowing this errant process is associated with the EM agent makes it much easier to kill (it won’t be a required database process or user session). Unfortunately, until the patches are applied this is the only way to free up the CPU and ‘solve’ the performance issue.
This problem doesn’t affect all systems, but it is considered a generic bug and is treated as such. Knowing something like this COULD affect you puts you in a better position to immediately address it should the need arise.