A little personal history...
Some six or seven years ago, I was working for a utility
company, trying to extract high-level reports from their complex OLTP system.
It turned in to a project of fiendish complexity, that essentially tried to
answer the question, "Are we making a profit?" from a system holding all the
individual income and expenses. Even for a person with a pretty extensive
background in extracting information from transactional data it was a daunting
task, with many failed experiments, dead-ends encountered, and
systems-brought-to-their-knees by full table scans.
What was really needed, although we did not realize it until
rather late in the game, was a separate data warehouse or decision support
system. However, even with some books by experts on data warehousing in
general, and with Oracle-specific technical documentation, making the mental
leap from OLTP to this new paradigm was a daunting prospect.
This series of articles is intended to be the reference that
we were looking for, to answer both the questions that we had and those that we
should have been asking had we suspected that we needed the answers.
OK, at the time, the data warehousing tools at our disposal
pretty much amounted to bitmap indexes -- which the DBAs wouldn't let me use --
and the star query -- now pretty much a thing of the past. However, if such
features as table compression, materialized views with query rewrite,
dimensions, bitmap join-indexes etc. had existed at the time, then I hope that
this series would have been our savior.
Focusing on the problems...
I don't intend that these articles be a dumping ground for
information that you can look up in the excellent Oracle 9iR2 Data Warehousing
Guide, or in the Concepts Manual, or in the SQL Reference. You should be intimately
familiar with the latter two if you are to make a success of your databasing
efforts, warehousing or otherwise. Rather, they will be a guide that firstly
lets you know that this information exists, and secondly that will extract and
organize some key information from these and other sources, and refer to them
often. It will also offer advice on what features I have found to be the most
important, which ones I have found to be rather optional and how to recognize
the time to just say "no" to shiny new stuff.
Each article will tend to focus on a particular set of
decisions that we have to confront in our data warehousing efforts, and provide
information to help you isolate the key issues, identify potential solutions,
and defend your decisions to your skeptical co-workers, who peer out over the
battlements of their transaction processing systems in no-way prepared for the
different priorities of the data warehouse.
A high-level view...
As an introduction to the series I want to talk about a very
few high-level differences between OLTP and Data Warehousing systems. For those
who have walked the path before there will be few surprises, if any, but these
notes will lay the groundwork for many of the comments and much of the advice
One of the characteristics most often associated with data
warehouses is their larger size relative to OLTP systems. In some ways this is
rather obvious, but let us just look at two reasons why this may be so.
A data warehouse built using data from an OLTP system will
tend to support an historical view of the source system's transactions, hence
there may be many "versions" of an inventory record, for example, telling us
inventory levels for the past year at a daily, weekly, or monthly level. The
data warehouse may therefore contain 365-times, or thirty-six times more low-level
records than the OLTP system. This is not always the case though - if you are
warehousing retail sales records then the reason that the data warehouse is
larger may simply be that the OLTP system is purged of old data on a regular
basis. Possibly this is only happening because there is a data warehouse to
copy records to, so that they may be purged.
Whatever the reason for the size difference we must remember
that having a lot of bytes on disks is not necessarily a problem in itself, and
the raw size difference seems overstated to me. The methodologies for storing
and accessing data are so different between the two database types that the raw
size issue is misleading enough to be almost irrelevant.
This brings us neatly to the second issue, of how data is accessed.
An interesting metaphor describes OLTP databases as
"flickering" systems. If you were able to take a virtual look at all the data,
laid out in front of you, then the data reads and changes might appear as
small, scattered, flickering bursts of light. Individual rows are inserted,
individual columns are updated and there is almost a continuous activity level
as tens, hundreds or thousands of simultaneous operations. The occasional report,
bulk data extraction, or bulk data load might make your eyes water for a second
but they are generally kept aside for periods of lower transactional activity.
In contrast, as you watch your data warehouse you see many
distinct reporting activities lighting up large contiguous sections of some
tables. Every now and then on a regular basis a completely new set of data
appears -- maybe every fifteen minutes, maybe once a week, and if you sped up
the process, you might notice the appearance of more complex harmonics. For example,
every hour the new transactions get loaded, every 24 hours the customer records
are updated, and every month the price list is modified.
This data load process can usefully be thought of as the
heartbeat of the system-- everyone ignores it until it stops, and if it does
stop then you can be sure the alarms will go off, the phone system will light
up, and the IT director might even make a personal visit just to say "Hi" and
look over your shoulder for a while.
One of the other observations that you might make from your
virtual views of the two systems is that there are vast areas of the data
warehouse that are never modified. They may be read frequently, though not as
often as the most recent data, but once they have been created they sit
unchanged. This is sometimes the case with the OLTP system, where permanent
invoicing records are being generated, for example, but in general, a much
higher proportion of the data are fair game for future changes.
What it all means...
In the data warehousing environment we are handed some
distinct advantages over the OLTP system.
We know when data will arrive, because we control all of the data
We probably know how much data there will be.
We might know just what data is going to be in demand at
We know that once we have loaded our data, most of it will not
We also have some disadvantages.
We may have to deal with vast amounts of new data as fast as
Reporting requirements may change constantly.
In next months installment we'll take a look at disks,
datafiles, tablespaces and maybe discuss tables if we're lucky.
See All Articles by Columnist Dave Aldridge