Of the various constraints possible on relational tables, referential constraints are perhaps the most common … and most misused. Learn about the advantages and disadvantages of different methods to implement and enforce RI, and issues that must be addressed when implementing DBMS-enforced Referential Integrity.
Of
the various constraints possible on relational tables, referential constraints
are perhaps the most common … and most misused.
Basic definitions
Referential
constraints occur in most business data, and are often documented in data
models as one-to-many relationships between entities. For example, let’s
consider a retail application: one customer has one (or many) orders; one order
contains one (or many) products; and so forth.
A
database where all such data and their relationships are consistent is said to
have referential integrity (RI). How is the data kept consistent? For example,
how do we ensure that an order inserted into the ORDER table is for a valid,
existing customer? Also, if we must delete a customer, what happens to the
customer’s outstanding orders?
To
summarize the elements of an RI relationship using our retail example:
-
The
CUSTOMER table is the parent table, and contains rows whose unique
identifier (the primary key) is the value in column customer_number. -
The
ORDER table is the child table, and contains rows having values for
customer_number (the foreign key) that are required to correspond to
values in the CUSTOMER table. -
This
referential constraint (parent table with primary key, child table with foreign
key) also includes a delete rule. This specifies what logical action
must be taken when a row from the parent table is deleted. The most common
actions: Restrict (do not allow a deletion of a CUSTOMER if any
corresponding ORDERs exist); and Cascade (when a CUSTOMER row is
deleted, delete all corresponding ORDER rows).
When
we implement the data model and its relationships as tables in a relational
database (such as DB2 for z/OS) there are several ways to implement and enforce
RI. The most common are:
-
Application-enforced. Have applications
check data values as they are entered into the database. When deleting parent
table rows, the application must take the appropriate action with the
"orphaned" child rows. -
Database
management system (DBMS)-enforced. Have the DBMS automatically enforce valid data
values and, based on the delete rule, automatically take the action specified.
Let’s
continue with our customer/order example.
If
we implement application-enforced RI, then any process that inserts a row to
the ORDER table must first interrogate the CUSTOMER table and ensure that the
customer already exists. Any process that deletes a CUSTOMER must implement the
RI delete rule and take action on the corresponding ORDER rows (either prevent
the CUSTOMER delete from happening if matching ORDER rows exist, or delete all
matching ORDER rows).
With
DBMS-enforced RI, processes are free to insert rows to the ORDER table with any
customer value they wish; however, the DBMS will prevent any insert having an
invalid customer value by checking the parent (CUSTOMER) table itself. Further,
the DBMS will automatically implement the delete rule.
Where do we enforce RI?
While
these two methods seem to be equivalent, there are several advantages and
disadvantages to each.
-
Change
control -
Application-enforced
RI embeds the RI-checking and delete rule logic in applications. How and where do
we document this? Changing a relationship now involves changing application
code, which also means re-testing, re-certifying, and so forth. -
DBMS-enforced
RI is stored in the database definition. Changing it is a relatively simple
database change, but if RI is changed, will application logic respond properly
to (potentially) new error messages? - Performance
-
Application-enforced
RI means potential multiple invocations of the DBMS software for insert and
delete operations. An ORDER row insert must be preceded by a query against the
CUSTOMER table. Each of these requires a call to the DBMS, with corresponding
parameter setup, error handling, documentation, and testing. -
DBMS-enforced
RI automates data validation and delete rule execution. -
Documentation
clarity -
Application-enforced
RI is embedded in application logic. This makes centralizing the documentation
about business rules difficult. -
DBMS-enforced
RI requires that all RI relationships and delete rules are included in the
database definition. This nicely centralizes RI documentation that can be
displayed easily with many modeling tools. -
Code
generation -
Many
code-generation software tools and query tools depend upon DBMS-enforced RI to
determine how related tables should be JOINed. With application-enforced RI,
all of these rules must be manually entered into the tool. -
Backup
and recovery -
In
a recovery situation (for example, a media failure) it is typical that the
parent and child tables will be recovered together to the same point-in-time.
How does the DBA determine this if we implemented application-enforced RI? The
only way is to have complete, consistent, up-to-date documentation on all RI
relationships. And, if these change, then application code changes must be
accompanied by changes in the DBA’s backup and recovery processes. -
With
DBMS-enforced RI, the DBA can query the database to determine what RI
relationships exist. With DB2 for z/OS, one can run the REPORT RECOVERY
utility, which will list related tables for recovery.
Is DBMS-enforced RI a best practice?
While
the above might indicate that DBMS-enforced RI is a best practice, there are
additional issues that must be addressed when implementing it. These include:
-
Data
environment -
While
it is common to implement RI in operational data, many data warehouse
implementations do not use it at all. This is because data inserted to the
data warehouse is presumed to be already validated. - Replication
-
Some
databases are created in order to maintain copies of all or most of another
database. There are several versions of this: examples include data
replication and publish/subscribe applications. These target
databases may receive copies of data in haphazard fashion; as a consequence, at
any particular moment they may not be consistent. For example, suppose we are
replicating customer and order information from database A to database B. When
all data is finally received then database B is consistent; however, there may
be moments when orders have arrived a few instants prior to their corresponding
customers. -
Code
tables -
Code
tables contain data typically used as abbreviations in other tables. One
example is U.S. state code abbreviations. It would be possible to implement RI
such that the code table is the parent and tables containing the code column
are defined as child tables; however, this is rarely done. By their nature,
code tables rarely experience row deletions. In addition, it is common for
applications to use code tables for data validation. For example, an
application inserting a customer address might use a U.S. state code table to
populate a pull-down list for an online user. Since the codes used in this
fashion are valid, there is no need to use RI to enforce data consistency. -
Existing
data and application logic -
Some
existing databases and applications may have already implemented
application-enforced RI. In this case, implementing DBMS-enforced RI would not
only be redundant but potentially wasteful of resources.
How do we mitigate the disadvantages of DBMS-enforced RI?
In the current version of DB2 for z/OS
(Version 8) there is a parameter in RI definition that the DBA can use to ease
the restrictions noted above. Here is a typical RI definition example:
... CREATE TABLE ORDER ... FOREIGN KEY Customer_Number REFERENCES CUSTOMER ON DELETE CASCADE
This syntax reads as follows (I’ve removed
other things like the column definitions for clarity): "Table ORDER
contains a foreign key column called Customer_Number. The parent table is
CUSTOMER. When a customer is deleted from the CUSTOMER table, delete all
corresponding rows from the ORDER table."
Now, consider the following slight change:
... CREATE TABLE ORDER ... FOREIGN KEY Customer_Number REFERENCES CUSTOMER ON DELETE CASCADE NOT ENFORCED
This last parameter, "NOT
ENFORCED", does several things:
-
The
RI relationship is still defined in the database, providing a centralized place
for RI rule documentation; -
The
DB2 REPORT RECOVERY utility will generate recovery information for the tables
as parent and child; -
It
allows data query and code tools to detect the RI relationship, permitting
complete data model and table JOIN documentation, and so forth; -
It
will not cause issues in data warehouse or replication environments.
Summary
Referential integrity is a property of
business data. It manifests itself as business rules defining relationships
between entities in a logical model. When a data architect transforms this
logical model into a physical database model, they must decide how the RI will
be enforced. The most common choices are application-enforced and
DBMS-enforced.
DBMS-enforced referential integrity is a best
practice. In cases where implementing RI this way may cause problems, DBAs can
use the NOT ENFORCED option to ease restrictions.