Referential Integrity: Best Practices for IBM DB2

Of the various constraints possible on relational tables, referential constraints are perhaps the most common … and most misused. Learn about the advantages and disadvantages of different methods to implement and enforce RI, and issues that must be addressed when implementing DBMS-enforced Referential Integrity.

Of
the various constraints possible on relational tables, referential constraints
are perhaps the most common … and most misused.

Basic definitions
Where do we enforce RI?
Is DBMS-enforced RI a best practice?
How do we mitigate the disadvantages of DBMS-enforced RI?
Summary

Basic definitions

Referential
constraints occur in most business data, and are often documented in data
models as one-to-many relationships between entities. For example, let’s
consider a retail application: one customer has one (or many) orders; one order
contains one (or many) products; and so forth.

A
database where all such data and their relationships are consistent is said to
have referential integrity (RI). How is the data kept consistent? For example,
how do we ensure that an order inserted into the ORDER table is for a valid,
existing customer? Also, if we must delete a customer, what happens to the
customer’s outstanding orders?

To
summarize the elements of an RI relationship using our retail example:

The
CUSTOMER table is the parent table, and contains rows whose unique
identifier (the primary key) is the value in column customer_number.
The
ORDER table is the child table, and contains rows having values for
customer_number (the foreign key) that are required to correspond to
values in the CUSTOMER table.
This
referential constraint (parent table with primary key, child table with foreign
key) also includes a delete rule. This specifies what logical action
must be taken when a row from the parent table is deleted. The most common
actions: Restrict (do not allow a deletion of a CUSTOMER if any
corresponding ORDERs exist); and Cascade (when a CUSTOMER row is
deleted, delete all corresponding ORDER rows).

When
we implement the data model and its relationships as tables in a relational
database (such as DB2 for z/OS) there are several ways to implement and enforce
RI. The most common are:

Application-enforced. Have applications
check data values as they are entered into the database. When deleting parent
table rows, the application must take the appropriate action with the
"orphaned" child rows.
Database
management system (DBMS)-enforced. Have the DBMS automatically enforce valid data
values and, based on the delete rule, automatically take the action specified.

Let’s
continue with our customer/order example.

If
we implement application-enforced RI, then any process that inserts a row to
the ORDER table must first interrogate the CUSTOMER table and ensure that the
customer already exists. Any process that deletes a CUSTOMER must implement the
RI delete rule and take action on the corresponding ORDER rows (either prevent
the CUSTOMER delete from happening if matching ORDER rows exist, or delete all
matching ORDER rows).

With
DBMS-enforced RI, processes are free to insert rows to the ORDER table with any
customer value they wish; however, the DBMS will prevent any insert having an
invalid customer value by checking the parent (CUSTOMER) table itself. Further,
the DBMS will automatically implement the delete rule.

Where do we enforce RI?

While
these two methods seem to be equivalent, there are several advantages and
disadvantages to each.

Change
control
- Application-enforced
  RI embeds the RI-checking and delete rule logic in applications. How and where do
  we document this? Changing a relationship now involves changing application
  code, which also means re-testing, re-certifying, and so forth.
- DBMS-enforced
  RI is stored in the database definition. Changing it is a relatively simple
  database change, but if RI is changed, will application logic respond properly
  to (potentially) new error messages?
Performance
- Application-enforced
  RI means potential multiple invocations of the DBMS software for insert and
  delete operations. An ORDER row insert must be preceded by a query against the
  CUSTOMER table. Each of these requires a call to the DBMS, with corresponding
  parameter setup, error handling, documentation, and testing.
- DBMS-enforced
  RI automates data validation and delete rule execution.
Documentation
clarity
- Application-enforced
  RI is embedded in application logic. This makes centralizing the documentation
  about business rules difficult.
- DBMS-enforced
  RI requires that all RI relationships and delete rules are included in the
  database definition. This nicely centralizes RI documentation that can be
  displayed easily with many modeling tools.
Code
generation
- Many
  code-generation software tools and query tools depend upon DBMS-enforced RI to
  determine how related tables should be JOINed. With application-enforced RI,
  all of these rules must be manually entered into the tool.
Backup
and recovery
- In
  a recovery situation (for example, a media failure) it is typical that the
  parent and child tables will be recovered together to the same point-in-time.
  How does the DBA determine this if we implemented application-enforced RI? The
  only way is to have complete, consistent, up-to-date documentation on all RI
  relationships. And, if these change, then application code changes must be
  accompanied by changes in the DBA’s backup and recovery processes.
- With
  DBMS-enforced RI, the DBA can query the database to determine what RI
  relationships exist. With DB2 for z/OS, one can run the REPORT RECOVERY
  utility, which will list related tables for recovery.

Is DBMS-enforced RI a best practice?

While
the above might indicate that DBMS-enforced RI is a best practice, there are
additional issues that must be addressed when implementing it. These include:

Data
environment
- While
  it is common to implement RI in operational data, many data warehouse
  implementations do not use it at all. This is because data inserted to the
  data warehouse is presumed to be already validated.
Replication
- Some
  databases are created in order to maintain copies of all or most of another
  database. There are several versions of this: examples include data
  replication and publish/subscribe applications. These target
  databases may receive copies of data in haphazard fashion; as a consequence, at
  any particular moment they may not be consistent. For example, suppose we are
  replicating customer and order information from database A to database B. When
  all data is finally received then database B is consistent; however, there may
  be moments when orders have arrived a few instants prior to their corresponding
  
  customers.
Code
tables
- Code
  tables contain data typically used as abbreviations in other tables. One
  example is U.S. state code abbreviations. It would be possible to implement RI
  such that the code table is the parent and tables containing the code column
  are defined as child tables; however, this is rarely done. By their nature,
  code tables rarely experience row deletions. In addition, it is common for
  applications to use code tables for data validation. For example, an
  application inserting a customer address might use a U.S. state code table to
  populate a pull-down list for an online user. Since the codes used in this
  fashion are valid, there is no need to use RI to enforce data consistency.
Existing
data and application logic
- Some
  existing databases and applications may have already implemented
  application-enforced RI. In this case, implementing DBMS-enforced RI would not
  only be redundant but potentially wasteful of resources.

How do we mitigate the disadvantages of DBMS-enforced RI?

In the current version of DB2 for z/OS
(Version 8) there is a parameter in RI definition that the DBA can use to ease
the restrictions noted above. Here is a typical RI definition example:

     ... CREATE TABLE   ORDER
     ...
     FOREIGN KEY  Customer_Number
     REFERENCES CUSTOMER
     ON DELETE CASCADE

This syntax reads as follows (I’ve removed
other things like the column definitions for clarity): "Table ORDER
contains a foreign key column called Customer_Number. The parent table is
CUSTOMER. When a customer is deleted from the CUSTOMER table, delete all
corresponding rows from the ORDER table."

Now, consider the following slight change:

     ... CREATE TABLE   ORDER
     ...
     FOREIGN KEY  Customer_Number
     REFERENCES CUSTOMER
     ON DELETE CASCADE
     NOT ENFORCED

This last parameter, "NOT
ENFORCED", does several things:

The
RI relationship is still defined in the database, providing a centralized place
for RI rule documentation;
The
DB2 REPORT RECOVERY utility will generate recovery information for the tables
as parent and child;
It
allows data query and code tools to detect the RI relationship, permitting
complete data model and table JOIN documentation, and so forth;
It
will not cause issues in data warehouse or replication environments.

Summary

Referential integrity is a property of
business data. It manifests itself as business rules defining relationships
between entities in a logical model. When a data architect transforms this
logical model into a physical database model, they must decide how the RI will
be enforced. The most common choices are application-enforced and
DBMS-enforced.

DBMS-enforced referential integrity is a best
practice. In cases where implementing RI this way may cause problems, DBAs can
use the NOT ENFORCED option to ease restrictions.

»

See All Articles by Columnist
Lockwood Lyon

Referential Integrity: Best Practices for IBM DB2

Basic definitions

Where do we enforce RI?

Is DBMS-enforced RI a best practice?

How do we mitigate the disadvantages of DBMS-enforced RI?

Summary

Lockwood Lyon

Company

Categories