Referential Integrity: Best Practices for IBM DB2


Of the various constraints possible on relational tables, referential constraints are perhaps the most common … and most misused. Learn about the advantages and disadvantages of different methods to implement and enforce RI, and issues that must be addressed when implementing DBMS-enforced Referential Integrity.

Of
the various constraints possible on relational tables, referential constraints
are perhaps the most common … and most misused.

Basic definitions

Referential
constraints occur in most business data, and are often documented in data
models as one-to-many relationships between entities. For example, let’s
consider a retail application: one customer has one (or many) orders; one order
contains one (or many) products; and so forth.

A
database where all such data and their relationships are consistent is said to
have referential integrity (RI). How is the data kept consistent? For example,
how do we ensure that an order inserted into the ORDER table is for a valid,
existing customer? Also, if we must delete a customer, what happens to the
customer’s outstanding orders?

To
summarize the elements of an RI relationship using our retail example:

  • The
    CUSTOMER table is the parent table, and contains rows whose unique
    identifier (the primary key) is the value in column customer_number.
  • The
    ORDER table is the child table, and contains rows having values for
    customer_number (the foreign key) that are required to correspond to
    values in the CUSTOMER table.
  • This
    referential constraint (parent table with primary key, child table with foreign
    key) also includes a delete rule. This specifies what logical action
    must be taken when a row from the parent table is deleted. The most common
    actions: Restrict (do not allow a deletion of a CUSTOMER if any
    corresponding ORDERs exist); and Cascade (when a CUSTOMER row is
    deleted, delete all corresponding ORDER rows).

When
we implement the data model and its relationships as tables in a relational
database (such as DB2 for z/OS) there are several ways to implement and enforce
RI. The most common are:

  • Application-enforced. Have applications
    check data values as they are entered into the database. When deleting parent
    table rows, the application must take the appropriate action with the
    "orphaned" child rows.
  • Database
    management system (DBMS)-enforced
    . Have the DBMS automatically enforce valid data
    values and, based on the delete rule, automatically take the action specified.

Let’s
continue with our customer/order example.

If
we implement application-enforced RI, then any process that inserts a row to
the ORDER table must first interrogate the CUSTOMER table and ensure that the
customer already exists. Any process that deletes a CUSTOMER must implement the
RI delete rule and take action on the corresponding ORDER rows (either prevent
the CUSTOMER delete from happening if matching ORDER rows exist, or delete all
matching ORDER rows).

With
DBMS-enforced RI, processes are free to insert rows to the ORDER table with any
customer value they wish; however, the DBMS will prevent any insert having an
invalid customer value by checking the parent (CUSTOMER) table itself. Further,
the DBMS will automatically implement the delete rule.

Where do we enforce RI?

While
these two methods seem to be equivalent, there are several advantages and
disadvantages to each.

  • Change
    control
    • Application-enforced
      RI embeds the RI-checking and delete rule logic in applications. How and where do
      we document this? Changing a relationship now involves changing application
      code, which also means re-testing, re-certifying, and so forth.

    • DBMS-enforced
      RI is stored in the database definition. Changing it is a relatively simple
      database change, but if RI is changed, will application logic respond properly
      to (potentially) new error messages?
  • Performance
    • Application-enforced
      RI means potential multiple invocations of the DBMS software for insert and
      delete operations. An ORDER row insert must be preceded by a query against the
      CUSTOMER table. Each of these requires a call to the DBMS, with corresponding
      parameter setup, error handling, documentation, and testing.

    • DBMS-enforced
      RI automates data validation and delete rule execution.
  • Documentation
    clarity

    • Application-enforced
      RI is embedded in application logic. This makes centralizing the documentation
      about business rules difficult.

    • DBMS-enforced
      RI requires that all RI relationships and delete rules are included in the
      database definition. This nicely centralizes RI documentation that can be
      displayed easily with many modeling tools.
  • Code
    generation
    • Many
      code-generation software tools and query tools depend upon DBMS-enforced RI to
      determine how related tables should be JOINed. With application-enforced RI,
      all of these rules must be manually entered into the tool.
  • Backup
    and recovery
    • In
      a recovery situation (for example, a media failure) it is typical that the
      parent and child tables will be recovered together to the same point-in-time.
      How does the DBA determine this if we implemented application-enforced RI? The
      only way is to have complete, consistent, up-to-date documentation on all RI
      relationships. And, if these change, then application code changes must be
      accompanied by changes in the DBA’s backup and recovery processes.

    • With
      DBMS-enforced RI, the DBA can query the database to determine what RI
      relationships exist. With DB2 for z/OS, one can run the REPORT RECOVERY
      utility, which will list related tables for recovery.

Is DBMS-enforced RI a best practice?

While
the above might indicate that DBMS-enforced RI is a best practice, there are
additional issues that must be addressed when implementing it. These include:

  • Data
    environment
    • While
      it is common to implement RI in operational data, many data warehouse
      implementations do not use it at all. This is because data inserted to the
      data warehouse is presumed to be already validated.
  • Replication
    • Some
      databases are created in order to maintain copies of all or most of another
      database. There are several versions of this: examples include data
      replication
      and publish/subscribe applications. These target
      databases may receive copies of data in haphazard fashion; as a consequence, at
      any particular moment they may not be consistent. For example, suppose we are
      replicating customer and order information from database A to database B. When
      all data is finally received then database B is consistent; however, there may
      be moments when orders have arrived a few instants prior to their corresponding

      customers.
  • Code
    tables
    • Code
      tables contain data typically used as abbreviations in other tables. One
      example is U.S. state code abbreviations. It would be possible to implement RI
      such that the code table is the parent and tables containing the code column
      are defined as child tables; however, this is rarely done. By their nature,
      code tables rarely experience row deletions. In addition, it is common for
      applications to use code tables for data validation. For example, an
      application inserting a customer address might use a U.S. state code table to
      populate a pull-down list for an online user. Since the codes used in this
      fashion are valid, there is no need to use RI to enforce data consistency.
  • Existing
    data and application logic
    • Some
      existing databases and applications may have already implemented
      application-enforced RI. In this case, implementing DBMS-enforced RI would not
      only be redundant but potentially wasteful of resources.

How do we mitigate the disadvantages of DBMS-enforced RI?

In the current version of DB2 for z/OS
(Version 8) there is a parameter in RI definition that the DBA can use to ease
the restrictions noted above. Here is a typical RI definition example:

     ... CREATE TABLE   ORDER
     ...
     FOREIGN KEY  Customer_Number
     REFERENCES CUSTOMER
     ON DELETE CASCADE

This syntax reads as follows (I’ve removed
other things like the column definitions for clarity): "Table ORDER
contains a foreign key column called Customer_Number. The parent table is
CUSTOMER. When a customer is deleted from the CUSTOMER table, delete all
corresponding rows from the ORDER table."

Now, consider the following slight change:

     ... CREATE TABLE   ORDER
     ...
     FOREIGN KEY  Customer_Number
     REFERENCES CUSTOMER
     ON DELETE CASCADE
     NOT ENFORCED

This last parameter, "NOT
ENFORCED", does several things:

  • The
    RI relationship is still defined in the database, providing a centralized place
    for RI rule documentation;
  • The
    DB2 REPORT RECOVERY utility will generate recovery information for the tables
    as parent and child;
  • It
    allows data query and code tools to detect the RI relationship, permitting
    complete data model and table JOIN documentation, and so forth;
  • It
    will not cause issues in data warehouse or replication environments.

Summary

Referential integrity is a property of
business data. It manifests itself as business rules defining relationships
between entities in a logical model. When a data architect transforms this
logical model into a physical database model, they must decide how the RI will
be enforced. The most common choices are application-enforced and
DBMS-enforced.

DBMS-enforced referential integrity is a best
practice. In cases where implementing RI this way may cause problems, DBAs can
use the NOT ENFORCED option to ease restrictions.

»


See All Articles by Columnist

Lockwood Lyon

Lockwood Lyon
Lockwood Lyon is a systems and database performance specialist. He has more than 20 years of experience in IT as a database administrator, systems analyst, manager, and consultant. Most recently, he has spent time on DB2 subsystem installation and performance tuning. He is also the author of The MIS Manager's Guide to Performance Appraisal (McGraw-Hill, 1993).

Latest Articles