A Practical Guide to Data Warehousing in Oracle, Part 5
April 28, 2004
On a regular OLTP system, the integrity of the data is generally supported in two ways - firstly by logic embedded in the application, and secondly by constraints placed on the logical structure of the database itself. The two may be complementary, as a designer may rely on the latter to support the former, through interpretation of error messages raised by the database.
In a data warehousing environment it is common practice to rely more heavily on the application level to guarantee the data integrity - the application in this instance is the Extraction, Transformation & Loading" (ETL) process (or ELT process, depending on your choice of technologies).
Why the Difference?
In the minds of dedicated OLTP professionals, alarm bells are now ringing loudly - except maybe for those with a lifetime of experience on major enterprise-level software, who may never have heard of constraints of any type. So why the difference? Why can a data warehouse be operated successfully and safely without integrity being enforced at the database level?
The simple answer is that the ETL application is itself a reasonably simple application, and the data warehouse is a reasonably simple logical structure. The ETL process may apply complex business cleansing rules, but there are generally only few paths through which it will modify the database. A fact table might only be touched by a single sub-process inserting new rows of data. Contrast this to the number of different operations that might be performed on a table of an OLTP system - the database-side constraints are a necessary precaution against application error.
The enforcement of constraints is often easier in the ETL layer because it has special information about the data. For example if the logical primary key on a data warehouse fact table involves a date on which stored transactions took place, then the ETL process might be able to guarantee that new data will not violate such a key by only retrieving from the source data set information on transactions that occurred after the latest in the fact table. Alternatively, the ETL process might have to detect values of a column in a fact table and maintain the appropriate dimension table, so foreign keys would never find a violation of the relationship that they enforce.
Protection against invalid data and data relationships is not the only function of database-side constraints of course. The Oracle optimizer can leverage constraint information to produce more effective execution plans. As a simple example, knowledge that an indexed column is constrained to be NOT NULL enables Oracle to use the index to find all the unique values of that column.
So What's Wrong With Constraints?
Given that constraints provide valuable help to the optimizer, what is the problem with making use of them? We must first acknowledge that there are two or three elements to these constraints.
So which of these elements are desirable and undesirable in the context of a data warehouse?
The rejection of undesirable modifications to the data is undesirable, for the reasons previously stated - the ETL process will often render such a process redundant and the overhead of duplicating the validation is an unnecessary burden on the database.
The supporting structures can be especially unwelcome for two reasons. Firstly, there is the storage requirement for them, which will probably be significant in comparison to that required for the fact table itself, especially if the table's data segment(s) is compressed. However, given the comments in the earlier article in which I advised that a data warehouse runs out of i/o capacity long before it runs out of storage capacity this is less significant than the second reason for preferring to not maintain these indexes, which is the burden that the maintenance places on CPU load and i/o capacity. This burden is imposed at a time when the system is already busy with the loading of new data, and if the data load process runs very frequently, or continuously, instead of during a comfortable off-hours load window, then the overhead is doubly unwelcome.
Finding a Middle Ground
So hopefully that is sufficient to make the case for a set of constraints that perform only the function of providing information to the optimizer. Oracle provides this functionality in the form of the RELY keyword, which allows a constraint to be created in a disabled and non-validating state - meaning that it does not check new or existing data for integrity - while telling the optimizer that it can infer that the data meets the declared constraint definitions. We can distinguish between this form of integrity and the regular enforced kind by referring to it as declarative integrity.
Limitations on Declarative Integrity
Do not be disheartened by the limitations listed above - given the choice of creating enforced integrity, declarative integrity, or no integrity, I would always go for the declarative. Even if the optimizer is not going to make much use of the information, it is there to provide metadata to ETL applications, front-end applications, and support staff.
It will also stop those uninitiated OLTP-persons from mocking you for not building integrity into your database - with declarative integrity, they will never know the difference.