We
are finally getting back to where we should be: designing databases right the first
time. There is, however, a small problem: What do we do with current
implementations? How do we backfill quality into them?
In
the context of database design what does quality mean?
- Defect prevention during design
- Defect detection and remediation during construction and testing
- Data verification as close to the source as possible
- Stability, reliability, data accessibility, systems scalability
The ideals of quality assurance for IT
systems delivery have weakened to something more akin to "mediocrity
prevention". Quality is sacrificed so that critical IT functions can be
rolled out on time, and within budget. This means a shift in focus from
development to support.
Many more errors will now be detected
after system implementation, thus putting more pressure on database
administration support staff. The cost of addressing these errors may not be
charged back to systems development, leading to invalid cost estimates during
database design.
The result: there will be a major
shift in the workload of DBAs. They will no longer have time for software
installation and customization, performance monitoring and tuning, data backup
and recovery, and other tasks that contribute to quality. Instead, they will
spend more and more time fixing errors. Errors that should have been prevented,
detected, or corrected during database design.
How Does This Affect the DBA?
The lack of quality in recently
delivered systems affects those in technical support positions the most. They
are the ones that must deal with system problems in real-time. This drives up
the cost of problem determination and problem resolution. It also manifests
itself in product behaviors that may annoy … or drive away customers.
The most common problems encountered
by the DBA after an implementation are either performance-related or
contention-related. That is, database processes (like processing SQL) run too
slowly, or one process contends with another for access to data.
Typical fixes by the DBA in this
reactive situation include changing or adding indexes, reorganizing tablespaces
and indexes, and changing tablespace partitioning or clustering to mitigate
contention. Many of these could have been engineered into the database design prior
to implementation.
How Do We Introduce Quality?
There are several reactive methods for
introducing quality into currently existing databases. Initial steps should
concentrate on the following.
First, DBAs should coordinate and
collect lists of frequent problems and questions that they encounter.
Management can then categorize and analyze these problems, perhaps noting
trends. For example, frequent complaints about lack of authority to execute
certain functions may indicate an underlying security issue. Frequent errors
that indicate database unavailability may indicate network, space allocation,
or server issues.
Next, expand on this by ensuring you
are using a good problem tracking system. This helps the support staff focus on
tasks and priorities, while providing management with hard data on resource
usage. How much time are the support staff spending fixing application issues
when they could (or should) be focusing on systems performance, server
maintenance, or network connectivity?
Next, use your tools to report the
support costs to the development areas. This is not to be used (at least
initially) as a chargeback mechanism; instead, you are making support costs
more visible to management. IT management must be made aware that fixing
application problems after implementation is more costly than fixing them in
either the testing or design phases.
Reengineering Legacy Systems
Given a legacy system, how does one
approach reengineering it to improve quality?
Systems engineering, including most
software development methodologies, concentrate on error prevention. This is
done by using quality metrics and processes during the early phases. In some
cases, quality is such a high priority that the systems are designed in part to
be self-correcting. Systems exhibiting such behavior are termed autonomic.
In our case, we already have a system
implemented. Here, correcting errors is mostly reactive, and assumes that you
can actually detect errors when they happen (or soon thereafter). Since we are
unable to re-design the system, we must concentrate on increasing the quality
of error detection and reporting.
In one sense, the advent of database
management systems and communications systems such as DB2, CICS, and MQSeries
has made error detection much easier. Before these systems arrived, the most
typical error was one of total application failure, accompanied by a memory
dump.
Most of these error situations were
fatal (physical I/O errors, dataset full, division by zero, and so forth).
However, now even so-called fatal errors are reported back to the application
in the form of error codes. For example, a DB2 application encountering a
physical I/O error receives a specific SQLCode and additional status
information.
So, increasing the quality of error
detection and reporting is relatively straightforward in these cases. Based on
your previous compilations of frequently encountered errors and problems:
- Document the most common error codes
-
Document the additional status information returned to the
application -
Determine how (and to whom) such errors should be reported, and
what information is most useful -
Use this as a basis for either updating or designing one or more
standard error processing modules - Embed these modules in your legacy applications.
Too simple, you say? Too costly to
implement? Compare this cost against the time and resources spent doing problem
determination and resolution. Emphasize to your management that this approach
uses information you already possess, deals directly with the most common
problems, and has the potential to detect future problems as they occur, thus
speeding up problem resolution.
What Next?
With better error detection and
correction processes in place, the next step is the expansion of the standard
error processing modules to include errors that you have not encountered yet.
The complete list of error codes exists in the production documentation. A few
hours spent researching possible errors codes will result in an expanded list
of errors that applications can detect and report.
Too much work, you say? Consider the
delays you currently experience: A user encounters a symptom, which must be
reported to someone who determines the underlying problem, who must then
contact someone to fix the problem. Depending upon how unspecific the symptom
is, problem determination may take several hours. For example, what if the
customer gets a message "application unavailable". What does that
mean? Who should be told? What do you fix?
Contrast this with a standard error
module intercepting a specific error code. For example, the standard error
module receives a DB2 code indicating that a database is full. Based upon
criteria in the module, it may send an e-mail or page to support staff, or even
display a message on a support console. Errors are detected faster, they
contain sufficient information to define the problem, and they are routed to
someone responsible for a fix.
Now that’s quality.
Postscript: The Future
IT shops are now returning to a
process they left long ago: systems analysis and design using software
development methodologies and tools that incorporate quality metrics and
processes. Given the changes in tools, personnel, and priorities, how can you
ensure that you are developing quality systems?
I recommend implementing best
practices based on the Capability Maturity Model™. This is a method for
developing, implementing, and supporting systems based on a process of
continuous improvement.
Additional Resources
Carnegie Mellon Capability Maturity Model
IBM TechDocs library: Information on autonomics — "A
First Look at Solution Installation for Autonomic Computing", IBM document
SG24-7099, available at the
IBM Quality management solutions
American Productivity and Quality Center
American Society for Quality