Performance Problem *Slow Database*
As an example, a customer called because
there was a problem with database performance. The machine was a SUN Solaris 8
CPU with 8 database partitions, version 7 with FixPack 3. It was an OLTP type database.
The machine was projected to handle more customer connections than were actually
connected. A snapshot of the system resources showed high CPU utilization.
$ vmstat 1 10
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 5 160319 23319 0 0 0 0 0 0 475 5113 920 20 26 0 54
2 3 160319 23240 0 0 0 0 0 0 541 4791 929 23 24 0 53
1 3 160319 23250 0 0 0 0 0 0 466 5219 842 16 25 0 59
2 3 160319 23213 0 0 0 0 0 0 439 4151 868 21 29 0 50
5 2 160319 23204 0 0 0 0 0 0 509 6069 829 30 21 0 50
0 3 160319 23181 0 0 0 0 0 0 455 6385 865 28 18 0 54
0 3 160319 23183 0 0 0 0 0 0 456 6472 849 25 35 0 40
3 2 160319 23235 0 0 0 0 0 0 382 5346 605 26 23 0 51
2 4 160176 23404 0 0 0 0 0 0 407 6537 624 34 26 0 40
The memory situation was good, no
swapping. Disk IO was very low, so the database had no problem working with the
disk subsystem. We had a problem with high CPU utilization. On the 8 CPU system
(idle time) is 0. So, what could cause that high CPU usage?
In most cases, the reason for high CPU utilization is badly written SQL application
code or a database locking problem. It is not unusual for an overloaded machine to
suffer from both. Frequently, non-optimal SQL code will cause a locking problem.
List of the database locking parameters:
>> db2 get db cfg for ARTIST | grep LOCK
Max storage for lock list (4KB) (LOCKLIST) = 50
Percent. of lock lists per application (MAXLOCKS) = 10
Lock timeout (sec) (LOCKTIMEOUT) = -1
Interval for checking deadlock (ms) (DLCHKTIME) = 10000
List of the database commit parameters:
Group commit count (MINCOMMIT) = 1
Percent log file reclaimed before soft chckpt (SOFTMAX) = 100
A locking information snapshot:
Lock list memory in use (Bytes) = 2952 bytes
Lock Escalations = 10023
Exclusive Lock Escalations = 1322
Locks Held Currently = 43
Current Applications Waiting on Locks = 6
Lock waits = 2322333
Time database waited on locks (ms) = 38447777
Average wait time per lock = 233.00 ms
Connections to the monitored database:
db2 => LIST APPLICATIONS
Auth Id Application Appl. Application Id DB # of
Name Handle Name Agents
-------- -------------- ---------- ------------------------------ -------- -----
APPEASY ARTIST101.exe 46 AA34000A.0BBA.030716115358 ARTIST 1
APPEASY ARTIST101.exe 17 AA34000A.0BCD.030716115401 ARTIST 1
APPEASY ARTIST101.exe 13 AA34000A.0BBE.030716115359 ARTIST 1
DB2INST1 db2bp_s 133 *LOCAL.db2inst1.030716082332 ARTIST 1
Checking database log file db2alert.log
for possible errors:
2003-07-16-10.30.10.642353 Instance:db2inst1 Node:003
PID:12255(db2agntp (ARTIST) 3) Appid:*N0.db2inst1.030716072626
data_management sqldEscalateLocks Probe:2 Database:ARTIST
-- Lock Count, Target : 113791, 56895
-- Table (ID) Name : (123;5) APPEASY.VRES111
-- Locks, Request Type : 113769, X
-- Result (0 = success): 0
-- Lock Count, Target : 113784, 56892
-- Table (ID) Name : (37;2) APPEASY.TRES222
-- Locks, Request Type : 113781, S
-- Result (0 = success): 0
2003-07-16-16.13.28.046685 Instance:db2inst1 Node:004
PID:60794(db2agent (ARTIST)) Appid:*LOCAL.db2inst1.030627061228
data_management sqldEscalateLocks Probe:3 Database:ARTIST
-- Lock Count, Target : 1141, 570
-- Table (ID) Name : (2;39) APPEASY.MESSAGES
-- Locks, Request Type : 514, X
-- Result (0 = success): FFFF8502 -> DEADLOCK
2003-07-16-16.15.50.070572 Instance:db2inst1 Node:002
PID:60794(db2agent (ARTIST)) Appid:*LOCAL.db2inst1.030708151448
data_management sqldEscalateLocks Probe:3 Database:ARTIST
-- Lock Count, Target : 631, 574
-- Table (ID) Name : (2;69) APPEASY.PAYMENTS
-- Locks, Request Type : 500, X
-- Result (0 = success): FFFF8502 -> DEADLOCK
2003-07-16-17.17.52.015357 Instance:db2inst1 Node:000
PID:23998(db2agent (ANLDEB2)) Appid:*LOCAL.db2inst1.030709090853
data_management sqldEscalateLocks Probe:3 Database:ARTIST
-- Lock Count, Target : 1140, 570
-- Table (ID) Name : (2;39) APPEASY.ACCOUNTS
-- Locks, Request Type : 509, X
Result (0 = success): FFFF8502 ' DEADLOCK
The database log file had numerous deadlock occurrences. The
customer had already complained that end user's applications timed out from
time to time, with the following errors:
[DB2/6000] SQL0911N The current transaction has been rolled back because of a deadlock
or timeout. Reason code "2". SQLSTATE=40001
General
conclusion:
The
system was overloaded with only 3 database connections. In the database log
file there was a good deal of information about lock escalations and deadlock
situations. A deadlock
is an indication that:
applications
have contention problems caused by lock escalations
application
was built with wrong isolation level
application
lock ordering is not designed correctly
in very rare
cases there might be a problem with catalog tables that are locked for
repeatable read
This was
my first guess after getting this initial information.
Before
taking any further action, we decided to take a deeper look at the database. An
agreement was reached with the customer to collect information from the database
for one day. The collected information was converted to graphical charts using
Quest Central for DB2. The locking information points to the actual problem:

The database had
an average of 60 database locks for all connected applications. This was rising from time to time to 100 locks. We had only 3
applications running against the database.
It is not unusual to expect a thousand locks
held with more than 30 connected applications, which is what the customer
wanted in the first place.
The lock information can be collected on an
individual agent level using "snapshot for locks." For example, getting
lock info for the application handle number 13:
db2 => LIST APPLICATIONS
Auth Id Application Appl. Application Id DB # of
Name Handle Name Agents
-------- -------------- ---------- ------------------------------ -------- -----
APPEASY ARTIST101.exe 46 AA34000A.0BBA.030716115358 ARTIST 1
APPEASY ARTIST101.exe 17 AA34000A.0BCD.030716115401 ARTIST 1
APPEASY ARTIST101.exe 13 AA34000A.0BBE.030716115359 ARTIST 1
DB2INST1 db2bp_s 133 *LOCAL.db2inst1.030716082332 ARTIST 1
db2 > GET SNAPSHOT FOR LOCKS FOR APPLICATION AGENTID 13
This command will produce a
report for all of the holding locks by an agent number 13.
Other useful information, "Average
lock wait time" was collected:

Average lock wait time was from 250 ms to 4.000 ms. It was
very high and not suitable for a multi-user concurrent system. Locking too many
resources made the UNIX machine overload and the applications locked in waiting
to get a lock. Even adding additional CPU units will not solve this problem.
Database
and Application Tuning
We are
going to solve this problem, in a conversation with the application developers.
They need to provide the DBA with information about the target database load
and we need to convince them to change their SQL code and the way that they
lock database objects.
The
statement from the application developers regarding the target database load:
"We suppose that a maximum of 35
concurrent programs will attack the database and each will process around 20000
records from the customer table."
During database observations, we had
noticed high "Exclusive escalations," which was equal to the
total number of escalations in the database. An exclusive escalation is when
the locks have been escalated from one or several row locks to one exclusive
table lock. It is controlled by the database's internal mechanism. Normal lock
escalations are controlled by the application code. The problem was mainly too
small of a lock list size for the current number of running applications or,
one or more applications are using an excessive number of the locks.
Very basic calculations for the lock parameters:
1 connections -> 20.000 records = 20.000 row lock x 72
bytes = 1,5 MB
35 connections -> 1,5 MB x 35 users =52 MB + additional
space for the internal processing and the other users, up to the 60MB total.
To solve the problem with the high lock escalations, we
decided to change the
locklist size from 200KB to 60MB.
Max storage for lock list (4KB) (LOCKLIST) = 50 -> 15000
Percent. of lock lists per application (MAXLOCKS) = 10 -> 20
The parameter "Maxlock" was changed from 10% (20
KB) to 5% (3MB), so one application could hold 3MB of lock space before starting
the lock escalation procedure. After these changes, we had eliminated a 'lock escalation' where DB2 replaces several 'row locks' with the
one 'table lock'.
Additional improvements were implemented
in the application code by the developers:
The situation with deadlocks was
discussed and managed with the following changes:
Lock timeout (sec) (LOCKTIMEOUT) = -1 -> 60
Interval for checking deadlock (ms) (DLCHKTIME) = 10000
The application will wait a maximum of 60
seconds to lock the resource. Deadlock timeout will resolve a deadlock
condition after 10 seconds. This will bring down high CPU utilization. The application
was changed to handle exceptions regarding inability to aquire lock in limited
time of 60 seconds.
The commit behavior was also changed with the following parameter:
Group commit count (MINCOMMIT) = 1 -> 3
This change will force the grouping of three commit
requests into one write. This will manage optimal writing for COMMMIT records
to disk without additional overhead. We decided to do that, since numerous
applications would request a commit within a very short time.
The developers
changed the application code for one critical batch process. Now they are
executing the job with additional DB2 connections, separating jobs between
connections and keeping rows locked for less time than before.
The database
was restarted and statistics again collected. We had no more Deadlocks and the Average lock wait
time was drastically reduced from 4.000 to 250 ms.
Newly generated statistics were much better:

The
Average lock wait time is still high, but no longer critical. CPU utilization
was dumped to a normal level of 25% and all connected users were satisfied with
performance.
Lock
situation overview after changes:


The locks
are shorter and the application behaves faster. Commit handling
implemented in the program code worked much better. Commits were made more frequently, reducing the time
locks were held on updated rows. Packages were rebinded with isolation level (CS) and locksize (PAGE) was changed to locksize (ROW). The batch programs were
implemented with RELASE(COMMIT) instead of the RELASE(DEALLOCATE) option specified during the BIND of the
package. The database log
file db2alert.log has no new entries regarding the deadlock problem, just
regular row lock escalation from time to time. The customer is satisfied and the
problem is solved.
Conclusion
Is it
realistic to expect the database engine to solve all our application problems?The database
is recently enhanced with the self-tuning and auto-management features but
still has no ability to rewrite all user, non-efficient application code. It is
amazing, how many sites running versatile DB2 applications, still are not capable
of running them concurrently.The only
solution that I see is close work between the application team and the
production DBA.
»
See All Articles by Columnist Marin Komadina