Performance Tune IBM DB2 z/OS Applications using Resource Constraint Analysis

For the DB2 for z/OS professional the two most common systems tuning
scenarios are tuning a DB2 data sharing group or tuning a series of application
SQL statements. The data sharing group environment can involve multiple
hardware installations and many other cross-system features and functions such
as coupling facilities and management policies. Resource constraint analysis is
a useful tool in both situations.

Resource constraint analysis involves monitoring, analyzing, and managing
resources. It can be done at many levels, from resource management across the
IT enterprise down to that of a single DB2 subsystem or instance or even to a
specific SQL statement. It can be done on many levels simultaneously.

This method of resource management can be done independent of hardware
platform or DB2 flavor (z/OS or Linux / Unix / Windows).

This article will concentrate on tuning a single DB2 for z/OS subsystem and
its applications. This is the ultimate goal; however, to get there we must both
understand overall system resource usage and address any current resource
constraints. Only then can we take a system-wide view in an efficient manner.

First Steps

There are many ways to get an introduction to the basics of systems tuning
including IBM and other vendor training, conference presentations, IBM manuals,
RedBooks, and so forth. Along with this knowledge, you will need some common
tools and techniques. Here are the initial tasks you must perform.

Step 1: Select Resources and Monitoring Tools

Begin by reviewing your monitoring and performance reporting tools. Most
tools have ways of measuring resources. Some typical resources and measurements
you want to consider are the following.

CPU cycles. This includes CPU usage by the DB2 address spaces and CPU
usage by thread and connection type. On an application level, these times are
displayed in the Accounting Short Report of the IBM DB2 Performance Expert for
z/OS (TM) as "Class 2 CPU Time".

One common metric is CPU Time per DML Statement. This can be
calculated by dividing the Class 2 CPU Time by the number of DML statements
issued by the application. Further refinements may be made by dividing the CPU
time by number of inquiry statements (Select, Fetch), or assigning weights to
DML statement types.

Central storage (memory). This includes memory usage, especially in
the DBM1 address space.

Disk storage (I/O). This includes time for I/Os, device-specific
reporting, random miss ratio, system miss ratio, time per sync I/O, and time
per GetPage. One common metric is I/O Wait Time per DML Statement. Much
of the I/O wait time appears in one or more entries on the Accounting Long
Report of the IBM DB2 Performance Expert for z/OS (TM) under Class 3
Suspensions, with the most relevant being Synchronous I/O Wait. Another common
metric is Asynchronous Prefetches per DML Statement, calculated from the
Total Prefetch count on the Accounting Short Report.

Object access (locking, thread wait, enqueues). This includes lock/latch
waits and in-DB2 and not-in-DB2 wait times. Many of these wait times are
reported on the Accounting Long Report under Class 3 Suspensions.

Application throughput. This includes SQL statements per unit of time
and application DB2 Class 2 elapsed time. Some of these measurements can be
derived from the Accounting Reports, while others are usually developed via
user-written algorithms that analyze job output messages or application-created
performance measures.

The IBM Resource Measurement Facility(TM) is an excellent tool for measuring
overall DB2 subsystem CPU, memory, and DASD resources, especially when used
with the RMF Spreadsheet Reporter. See the IBM web site reference at the end of
this article for more information.

Step 2: Choose Key Areas of Inquiry

Next, consider the various areas where you will concentrate your initial
attention. Base these choices on issues that are either urgent or important.

Urgent issues are those you must react to immediately or in the near
term; for example, your nightly batch cycle is running too long and interfering
with morning on-line transactions.

Important issues are more long-range and have the potential to
negatively affect budgets, service levels, or customer satisfaction. For
example, you may be concerned that quickly rising CPU usage will force your
department to add processor capacity sooner than anticipated.

Step 3: Choose Levels for Initial Measurements

Systems tuning seems to imply an emphasis on only system-level objects like
the DBM1 address space. I recommend that your first measurements and tuning
concentrate on critical objects at multiple levels. This will give you
experience in creating metrics, implementing period monitoring, and analyzing
results. It also has the advantage of allowing you to find "resource
hogs" that can be addressed to get immediate improvements.

Critical Tables and Indexes. Based on your own criteria determine
your most critical table and index objects. These may be your largest, the most
volatile, the most heavily-referenced, or those accessed by one or more
business-critical applications. They may also be the ones that may benefit most
from the table-based partitioning introduced in DB2 for z/OS V8, where you can
partition based on one key and cluster data based upon another.

With the critical objects identified, use the DB2 catalog (or other means)
to determine what application SQL statements access these objects. Concentrate
some specific monitoring on these statements.

Critical Applications. On another level, determine your most critical
applications and set up performance monitoring for them. These may be
business-critical, such as order entry or on-line customer inquiry
applications, or mission-critical such as an enterprise resource planning (ERP)
suite.

Critical Subsystem Components. Your choice of subsystem components to
measure and tune will depend in large part upon your environment. As companies
and IT enterprises vary greatly across industries there is no one set of
components that can be deemed most important. I recommend reviewing the DB2
performance Redbooks to determine where to concentrate your attention.

Typical examples of subsystem components to monitor are the virtual pool
sizes and thresholds (including the global buffer pools in a data sharing
environment), DBM1 address space memory usage (if you have virtual storage
constraint issues), DBM1 address space I/O performance, and CPU usage
attributed to enclaves initiated by the distributed data facility (DDF).

Critical Environments. The next level up involves environments. In
the z/OS environment, this usually means DB2 subsystems or data sharing groups.
Begin with your most critical environment (again based on your own criteria).
It may be an on-line production system, a data warehouse, or even your nightly
batch cycle.

Each type of environment has its own resource performance profile. For
example, a data warehouse will usually involve a heavy I/O load due to queries
during its interrogation period (the on-line day), but heavy I/O load due to
extract-transform-load processes during its update period. For another example,
consider a long-running batch cycle. In this case, the most common resource
constraint is time. Hence, performance management usually involves trading CPU
and I/O resources in order to shorten job elapsed times.

Step 4: Implement Period Monitoring and Reporting

After you have familiarized yourself with your resource measurement toolset
and chosen areas to concentrate your efforts you then construct and implement
regular resource period monitoring. This is monitoring and gathering data about
specific categories of objects over time. The purpose is to allow you to
analyze objects in the context of their time dependence (if any) to resource
constraints.

Perform Resource Constraint Analysis

Now, based on your measurements, follow this procedure.

Use the period monitoring reports to identify resource constraints.
For example, you determine that CPU usage during the on-line day is approaching
100% in a critical production region.

Use the reports to identify resources that are available or underutilized.
At the same time as CPU usage is high, you determine that GetPage rates are
relatively low.

Analyze resource usage looking for balancing opportunities. You will
be looking to reduce or eliminate a resource constraint by re-directing the
application to use a different resource. For example, you determine that you
have a CPU constraint during your nightly batch window. Upon further analysis,
you note that many applications access a large partitioned tablespace using CPU
parallelism. You can reduce the CPU constraint by inhibiting parallelism for
those applications; however, the applications may run longer, thus reducing
application throughput.

As another example, you measure I/O service times for a critical table and
find them to be excessive. You can reduce this I/O bottleneck by allocating a
virtual pool exclusively for this object. The trade-off is faster I/O (as
successive references to the table will most likely find it in memory) but
greater use of central storage.

So, by identifying available resources (application throughput and central
storage in the above examples), you can develop strategies for reducing
resource bottlenecks. Some possibilities might be:

Analyze critical tables and indexes; consider them to be
constraints and develop ways of using excess CPU, DASD, or other resources
to reduce contention
Analyze critical applications; consider them to be
constraints and develop methods of using excess resources to increase
application throughput
Analyze the DB2 subsystem; determine the resource
bottlenecks and develop techniques for using excess resources to relieve
the constraints

Summary

Performance tools, autonomics and outsourcing are already with us, and IT
shops are beginning to look at infrastructure costs with an eye to reducing
expenses. Performance tuning is a likely candidate to be reassigned to less
experienced personnel, leaving the experienced database technician with few
options.

Systems tuning provides us with an opportunity to dramatically reduce costs.
By beginning with basic performance measurements and constraint analysis, we
can find and correct resource hogs, re-allocate underutilized resources, and be
more proactive about future resource usage.