Describing a Problem
The test
database ARTIST is a multinode DB2 database, ver. 7.2 on Sun Solaris. It is a multinode
DB2 EEE database with 3 nodes.
The customer
is complaining about a performance problem with the database. The system is
very slow and every regular, routine process takes a long time to finish. Let's
look at the situation on the system:
# top
575 processes: 549 sleeping, 16 running, 1 zombie, 9 on cpu
CPU states: 0.0% idle, 94.6% user, 5.4% kernel, 0.0% iowait, 0.0% swap
Memory: 16G real, 6336M free, 11G swap in use, 5538M swap free
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
18864 db2udb1 4 10 0 1145M 1091M cpu/1 22:17 6.32% db2sysc
4891 db2udb1 4 10 0 1154M 1096M cpu/6 20:55 6.04% db2sysc
17753 db2udb1 4 10 0 1154M 1096M run 22:22 6.00% db2sysc
4883 db2udb1 4 10 0 1154M 1096M run 20:37 5.97% db2sysc
9571 db2udb1 4 20 0 1145M 1091M run 27:12 5.96% db2sysc
9552 db2udb1 4 10 0 1145M 1091M cpu/4 32:44 5.96% db2sysc
27413 db2udb1 4 11 0 1145M 1091M cpu/2 2:57 5.83% db2sysc
9603 db2udb1 4 10 0 1153M 1095M cpu/5 21:58 5.74% db2sysc
20471 db2udb1 4 20 0 1144M 1091M cpu/3 2:54 5.61% db2sysc
28498 db2udb1 4 0 0 1145M 1091M run 21:55 5.60% db2sysc
20564 db2udb1 4 10 0 1144M 1090M run 2:53 5.55% db2sysc
20536 db2udb1 4 10 0 1144M 1091M run 2:55 5.53% db2sysc
20469 db2udb1 4 10 0 1144M 1090M run 2:53 5.41% db2sysc
27452 db2udb1 4 11 0 1145M 1091M cpu/4 2:58 5.40% db2sysc
26419 db2udb1 4 10 0 1145M 1091M run 3:01 5.35% db2sysc
Output of
<<top>> UNIX program will generate a list of running processes on the machine,
with high CPU consumer on the top of the list. On our system, we have a
serious problem with db2 processes, which are using all of the CPU power (0.0%
Idle). From this output, we cannot identify from which node the request is coming.
To discover node information for the running database process we have to use the
following command:
>> /usr/ucb/ps -gauxwwr
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
db2udb1 16401 6.8 7.011515281116720 ? R 16:00:35 1:09 db2agntp (ARTIST) 2
db2udb1 16410 6.8 7.011515201116384 ? R 16:00:35 1:08 db2agntp (ARTIST) 2
db2udb1 7308 6.5 7.011720001118488 ? O 02:37:56 2:51 db2agntp (ARTIST) 2
db2udb1 9579 6.5 7.111720081132232 ? R 11:41:19 5:21 db2agntp (ARTIST) 1
db2udb1 17785 6.4 7.011720081118320 ? O 11:24:34 1:17 db2agntp (ARTIST) 1
db2udb1 17763 6.4 7.011858241129736 ? R 11:24:34 2:14 db2agntp (ARTIST) 1
db2udb1 29885 2.4 6.811318081097176 ? R 05:28:48 8:28 db2pfchr 2
db2udb1 29883 2.2 6.811318001098376 ? O 05:28:48 7:59 db2pfchr 1
db2udb1 29866 0.8 6.811318081097016 ? O 05:28:47 7:47 db2pfchr 1
db2udb1 9551 0.6 7.011817201123160 ? O 11:41:19 0:50 db2agntp (ARTIST) 2
db2udb1 29867 0.6 6.811318081096944 ? O 05:28:48 5:31 db2pfchr 1
db2udb1 18450 0.4 7.011656481118216 ? R 16:01:16 0:02 db2agent 2
The process
is coming from database partitions 1 and 2. The most critical processes are:
- db2agntp - Active subagent, used for SMP parallelism.
- db2pfchr - Prefetcher, making I/O to get data to db2agent
- db2agent - DB2 agent, handles requests from applications
This
situation leads to the conclusion that some big, <<killer>> query has been
started. Query runs in parallel, using several agents to process a result data
set.