DB2 for z/OS Database Recovery – Get it Right


Database backup copies may not be enough to ensure data recoverability. The
DBA must "bake" recoverability into the database design process;
additionally, for current systems, the DBA needs to know whether current backup
processes support recovery procedures that will meet application recovery
objectives.

The Laws of Database Administration

In order of their importance, the laws include recoverability, availability,
security, and performance. The easiest way to explain the laws is in the
negative sense; that is, what you shouldn’t do. In this form, the laws can be
stated as "Thou shalt not cause thy …

  • Data to become unrecoverable
  • Data to become unavailable
  • Data to become unsecure
  • System to perform poorly.

The first three laws deal with data management, the fourth with system
management. These encapsulate the four most important responsibilities of the
DBA.

The Order of Importance

Some might quibble with the order of the laws.
For example, one manager told me: "In our shop, performance is of highest
importance. We have service level agreements that we must meet in order to meet
our customer’s needs. Performance is more important than backup and
recovery."

Performance is certainly an urgent concern, but recoverability remains the most
important. To determine what’s really important, consider this anecdote:

You are a DBA supporting IT systems for a provider of medical and surgical
services. Your systems are used daily by doctors, nurses, and technicians to provide
services to patients. In some cases, these services (e.g., diagnosis) may
involve life-or-death decisions. Your supervisor asks you to give a
presentation to upper management (including the CIO, vice president of IT, and
major stockholders or owners) of new enhancements your department will make.
Your presentation begins as follows:

"Ladies and gentlemen, the DBA team will be implementing some
high-performance features in the near future. We have two implementation plans
and we’d like your assistance in choosing between the two.

"Plan A will involve performance changes that will result in up to 95
percent of our online transactions finishing in their required service
levels." You hear some grumbling from the audience, as they realize this
means 5 percent of the online transactions will perform poorly. You continue.

"Plan B will involve performance changes resulting in 100 percent of
our online transactions finishing in their required service levels." You
hear sighs of relief from your audience, as they are clearly more comfortable
with this plan.

"However," you continue, "if we have a major hardware
outage, there’s a good possibility that up to 5 percent of our data will be
missing or invalid."

Your audience now sits in stunned silence. They realize there could be a power
outage or other disaster that affects the IT systems. Should this occur, when
the systems come back up, every user will know there’s a 5 percent chance that
test results are missing, diagnoses are incorrect, or that some patient’s
records will have completely disappeared.

Which plan will your audience adopt? Which was more important to them:
recoverability or performance?

In the remainder of this article I will concentrate on the first law: data
recoverability.

Law #1: Data Recoverability

Ensure data recoverability. If there’s one thing to get right, this is it.
While other things (such as performance or security) may seem more urgent,
ensuring data recoverability is the database administrator’s most important
responsibility.

Recovery Considerations

Consider a project to implement a new database to support a critical
production application. If a disaster occurs, will the data be available in the
agreed-upon Recovery Time Objective (RTO)? If not, in the case of medical and
financial data, this may breach contracts with vendors or violate audit
guidelines.

Data recoverability is another major consideration in some legislation. Here
are some that affect financial institutions:

  • Expedited Funds Availability (EFA) Act, 1989 requires federally
    chartered financial institutions to have a demonstrable business
    continuity plan to ensure prompt availability of funds.
  • Federal Financial Institutions Examination Council (FFIEC)
    Handbook 2003-2004 (Chapter 10) specifies that directors and managers are
    accountable for organizationwide contingency planning and for "timely
    resumption of operations in the event of a disaster."
  • Basel II, Basel Committee on Banking Supervision, Sound
    Practices for Management and Supervision, 2003 requires that banks establish
    business continuity and disaster recovery plans to ensure continuous
    operation and limit losses.

Most IT shops use regularly scheduled standard backup procedures (e.g., DB2
image copies), but few have actually tested the recovery time of these objects
and analyzed whether their backup procedures are sufficient (or necessary) for
their recovery requirements. One common mistake is forgetting that tablespace
recovery using an image copy must also include rebuilds of indexes (which take
additional time).

The simplest way for the DBA to proceed is to ensure that recovery
requirements are documented during the requirements phase of systems design.
For existing systems, document the recovery requirements either immediately,
during the next phase of maintenance or enhancements, or during the next audit.
These documents can then be used to drive one or more internal projects to
measure application data recoverability, compare against recovery objectives,
and implement improvements.

Best practices for recoverability start with recovery, not with backups. The
DBA should not simply implement daily or weekly backups (image copies) as a
standard; instead, list possible recovery methods and options, their costs, and
their speeds. This can then be used during design or enhancement projects to
ensure recovery requirements are met.

Some of these methods include:

  • Full image copies
  • Incremental image copies (with optional change
    accumulation to full copies)
  • Image copies of indexes (especially large indexes)
  • Image copies using the ShrLevel Change option ("fuzzy
    copies")
  • Data replication
  • Hot or cold standby copies
  • Disk mirroring

DBAs should ensure they have all of the following:

  • A regularly scheduled process for determining (and
    documenting) the recovery status of all production objects;
  • Regular measurements of required recovery times for
    objects belonging to critical applications;
  • Development of alternative methods of backup and recovery
    for special situations (such as image copy of indexes, data replication to
    recovery site, and DASD mirroring);
  • Regular development, improvement, and review of data
    recoverability metrics.

Automation of the Recovery Process

In general, the DBA should automate reactive or simple reporting processes,
freeing them for higher-level work. Your first reaction might be, "Wait!
I’ll automate myself out of a job!" Far from it. Implementing automation
makes the DBA more valuable. IT management wants its knowledge workers doing
tasks that add value. These might include detailed systems performance tuning,
quality control, cost/benefit reviews of potential new applications and
projects, and more. Management understands that a DBA spending time on trivial
tasks represents a net loss of productivity.

The advantage of automation isn’t merely speed; automating tasks helps move
the DBA away from reactive tasks such as reporting and analysis toward more
proactive functions.

Here’s a typical list of processes many DBAs still manually perform:

  • Executing an EXPLAIN process for SQL access path analysis
  • Generating performance reports such as System Management
    Facility (SMF) accounting and statistics reports
  • Verifying that new tables have columns with names and
    attributes that follow standard conventions and are compatible with the
    enterprise data model and data dictionary
  • Verifying that access to production data is properly
    controlled through the correct authority GRANTs
  • Monitoring application thread activity for deadlocks and
    timeouts
  • Reviewing console logs and DB2 address space logs for
    error messages or potential issues.

Each of these tasks can be replaced by an automated reporting or a data
gathering process of some kind. With such processes in place, DBAs now can
schedule data gathering and report generation for later analysis, or guide
requestors to the appropriate screens, reports or jobs. This removes the DBA
from the "reactive rut" and generates time for proactive tasks such
as projects, architecture, planning, systems tuning, and recovery planning.

Along with choosing specific tasks to automate, you’ll probably need to
learn one or more automation tools or languages. REXX is an example of a
popular language for online or batch access to DB2 data. There are many
examples and ideas for automated processes in articles, presentations, and
white papers.

Autonomics in the Recovery Process

As our IT organizations have matured, we’ve become smarter about our
problems. We began to collect problem logs, and analyzed them looking for
trends and patterns. We began to recognize frequent problems and devised strategies
for automatically dealing with them or preventing them.

We’ve now reached the next logical step in this progression: engineering
processes and process control to make systems and applications self-aware and
self-healing. This is called autonomics. Autonomics, ranging from simple
scripts to complicated processes, can be applied to applications, systems, or
support software. For many DBAs, the idea of a self-healing database inspires
visions of the database redesigning itself.

What exactly would a self-healing database heal? One DB2 z/OS example that
comes to mind is real-time statistics. DB2 dynamically generates these data
distribution statistics. An example of their use is during reorg utility
execution, where they can be queried and the results used to decide whether or
not to execute the reorg.

These and other examples of DB2 autonomics make it possible to "program
in" a manner of self-tuning (or at least self-management) into the DBA’s
support infrastructure.

The next logical step is implementing autonomics into the recovery process.
The DBA can use real-time statistics to decide when to take image copies, or
what kind of copies (full or incremental).

Summary

In implementing best practices for database recovery, the DBA needs to move
away from reactive tasks, initiate quality measures, and offload basic, repetitive
tasks. I hope that you can use this material as you work to become more
productive. Remember to quantify and document your results, and then advertise
your value.

IBM
Application Recovery Tool for IMS and DB2 Databases A Data Recovery Guide

IBM DB2 Recovery Expert for z/OS User Scenarios

IBM System z Mean Time to Recovery Best Practices

IBM High Availability and Disaster Recovery Options for DB2 on Linux, UNIX, and Windows

IBM Backup and Recovery IO/O Related Performance Considerations

Term – Recovery Time Objective

»


See All Articles by Columnist

Lockwood Lyon

Lockwood Lyon
Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Latest Articles