On a regular OLTP system, the
integrity of the data is generally supported in two ways - firstly by logic
embedded in the application, and secondly by constraints placed on the logical
structure of the database itself. The two may be complementary, as a designer
may rely on the latter to support the former, through interpretation of error
messages raised by the database.
In a data warehousing environment
it is common practice to rely more heavily on the application level to
guarantee the data integrity - the application in this instance is the
Extraction, Transformation & Loading" (ETL) process (or ELT process,
depending on your choice of technologies).
Why the Difference?
In the minds of dedicated OLTP
professionals, alarm bells are now ringing loudly - except maybe for those with
a lifetime of experience on major enterprise-level software, who may never have
heard of constraints of any type. So why the difference? Why can a data
warehouse be operated successfully and safely without integrity being enforced
at the database level?
The simple answer is that the ETL
application is itself a reasonably simple application, and the data warehouse
is a reasonably simple logical structure. The ETL process may apply complex
business cleansing rules, but there are generally only few paths through which
it will modify the database. A fact table might only be touched by a single
sub-process inserting new rows of data. Contrast this to the number of different
operations that might be performed on a table of an OLTP system - the
database-side constraints are a necessary precaution against application error.
The enforcement of constraints is
often easier in the ETL layer because it has special information about the
data. For example if the logical primary key on a data warehouse fact table
involves a date on which stored transactions took place, then the ETL process
might be able to guarantee that new data will not violate such a key by only
retrieving from the source data set information on transactions that occurred
after the latest in the fact table. Alternatively, the ETL process might have
to detect values of a column in a fact table and maintain the appropriate
dimension table, so foreign keys would never find a violation of the
relationship that they enforce.
Protection against invalid data
and data relationships is not the only function of database-side constraints of
course. The Oracle optimizer can leverage constraint information to produce
more effective execution plans. As a simple example, knowledge that an indexed
column is constrained to be NOT NULL enables Oracle to use the index to find
all the unique values of that column.
So What's Wrong With Constraints?
Given that constraints provide
valuable help to the optimizer, what is the problem with making use of them? We
must first acknowledge that there are two or three elements to these
information that they provide concerning the intentions of the database designer.
use of this information by the database. It is used to reject modifications to
data that would result in a condition that would violate the intent of the
constraint, and it is used by the optimizer to infer what assumptions may be
made about the data.
supporting structures and the overhead required to maintain them that allow
efficient enforcement of the constraints. Primarily these are any indexes
associated with Primary, Foreign and Unique keys constraints.
So which of these elements are
desirable and undesirable in the context of a data warehouse?
The rejection of undesirable
modifications to the data is undesirable, for the reasons previously stated -
the ETL process will often render such a process redundant and the overhead of
duplicating the validation is an unnecessary burden on the database.
The supporting structures can be
especially unwelcome for two reasons. Firstly, there is the storage requirement
for them, which will probably be significant in comparison to that required for
the fact table itself, especially if the table's data segment(s) is compressed.
However, given the comments in the earlier article in which I advised that a
data warehouse runs out of i/o capacity long before it runs out of storage
capacity this is less significant than the second reason for preferring to not
maintain these indexes, which is the burden that the maintenance places on CPU
load and i/o capacity. This burden is imposed at a time when the system is already
busy with the loading of new data, and if the data load process runs very
frequently, or continuously, instead of during a comfortable off-hours load
window, then the overhead is doubly unwelcome.
Finding a Middle Ground
So hopefully that is sufficient
to make the case for a set of constraints that perform only the function of
providing information to the optimizer. Oracle provides this functionality in
the form of the RELY keyword, which allows a constraint to be created in a
disabled and non-validating state - meaning that it does not check new or
existing data for integrity - while telling the optimizer that it can infer
that the data meets the declared constraint definitions. We can distinguish
between this form of integrity and the regular enforced kind by referring to it
as declarative integrity.
Limitations on Declarative Integrity
is seems redundant to say so, it is important to note that declarative
integrity provides no additional information that could not be provided by regular
enforced integrity. There is a method for providing additional information on
data relationships to the database through the logical objects known as Dimensions
that will be covered in a future article - these are not the tables that we
refer to as dimension tables, but they do give the optimizer a better
understanding of the data within them.
that if you try to modify a NOT NULL constraint to the RELY state, you will be
welcomed with the ORA-25127:
RELY not allowed in NOT NULL constraint error. This is less of a problem
than it might seem for two reasons. Firstly, there is no index supporting the
NOT NULL constraint, so there is less of an advantage to disabling it. Secondly,
a data warehouse will more commonly use bitmap indexes for fact table columns,
and in contrast to b-tree indexes, bitmap indexes store null values and the
optimizer knows it. Thirdly, the overhead on validating that a column is not
null is very much less than just about any other constraint type.
extent to which the optimizer uses declarative integrity information on primary
and foreign key constraints is confined to query rewrite against materialized
views. If you are not using this feature then you will find limited benefits in
creating such constraints.
Do not be disheartened by the
limitations listed above - given the choice of creating enforced integrity,
declarative integrity, or no integrity, I would always go for the declarative.
Even if the optimizer is not going to make much use of the information, it is
there to provide metadata to ETL applications, front-end applications, and
It will also stop those
uninitiated OLTP-persons from mocking you for not building integrity into your
database - with declarative integrity, they will never know the difference.
See All Articles by Columnist Dave Aldridge