On a regular OLTP system, one of
the perennial sources for doubt and debate is the number, type and
configuration of indexes, and for good reason. An OLTP system generally runs
with some limited set of select queries and a range of updates, inserts, and
deletes. A balance has to be struck between providing rapid access to data for
all of these operations – the selects in particular, I suppose – and avoiding
an excess of overhead when the delete, insert, or update actually modifies the
table, and hence the indexes.
A data warehouse, once again, is
a different matter. Firstly, there is probably no limited set of select
statements that will be used – ad hoc queries are by their very nature rather
unpredictable. Secondly, the system is probably not attempting to maintain
indexes at the same time, as data is inserted row-by-row into the tables.
Therefore, here are some simple
principles you can apply to the business of indexing your data warehouse.
What Columns Should Be
This is pretty straightforward –
if a column might be used as a predicate or join, then index it. This probably
means that nearly every column is going to be indexed. You might be able to
eliminate metric columns from consideration if you are sure that no one is
going to want to get a list of all retail transactions of greater than $1,000
What Sort of Index Should I
We can start by eliminating from
consideration any reverse key indexes – we do not have much call for them as
they are primarily intended for supporting high-concurrency insert operations –
not the case here, and they also do not support range scans.
The two major types to consider
are Bitmap indexes and B-Tree indexes. B-Tree indexes are the regular type that
OLTP systems make much use of, and bitmap indexes are a highly compressed index
type that tends to be used primarily for data warehouses.
When using b-tree indexes in a
data warehouse you would do well to consider use of the COMPRESS parameter.
This causes compression of repeated entries within an index block (and hence a
larger block size can improve compression ratios). Compressed indexes, like
compressed data segments, represent a trade-off between CPU usage and disk
space usage. A compressed structure is faster to read from disk but takes
additional CPU cycles to decompress for access – an uncompressed structure
imposes a lower CPU load but requires more bandwidth to read in a short time.
A bitmap index can be thought of
as representing this trade-off carried even further – the internal structure is
very different, and allows a high degree of compression. The reason for
confining bitmap indexes to data warehouses is that the overhead on maintaining
them is enormous. A modification to a bitmap index requires a great deal more
work on behalf of the system than a modification to a b-tree index. In
addition, the concurrency for modifications on bitmap indexes is dreadful.
The advantages of them are that
they have a highly compressed structure, making them fast to read and their
structure makes it possible for the system to combine multiple indexes together
for fast access to the underlying table.
One belief concerning bitmap
indexes is that they are only suitable for indexing low-cardinality data. This
is not necessarily true, and bitmap indexes can be used very successfully for
indexing columns with many thousands of different values.
There does come a point however,
at which the cardinality rises high enough for the bitmap index to start losing
some of its edge over b-tree indexes. Exactly where this point is depends on
the data set itself. Not just on the number of distinct values and the number
of totals rows, but on the physical order of the rows as well. Bitmap indexes
respond very well to physically sorted data sets, and can shrink dramatically
in response to physical ordering.
If you suspect that you are
getting close to the line between using b-tree and bitmap indexes then
benchmark the two – start by comparing their relative sizes, then try comparing
performance on some sample queries.
Single Column or Composite
In general, I tend to go for
single column indexes simply because they are more flexible. I like the idea of
index skip-scans as a mechanism for accessing non-leading columns of a
composite index, but the cost of finding that the optimizer is not doing so is
too high in comparison to any overhead in maintaining multiple separate indexes
for me to be comfortable with them.
Having said that, there are
instances where one column gets a mention in nearly every report query. Generally,
it is a date, and in such cases, you might like to bite the bullet and start
building some composite indexes on the columns commonly used with such a
What you are generally saving
here is the cost of joining two indexes in a bitmap merge operation, and it is
often another case for benchmarking.
How About Those Bitmap Join
These are a fairly recent
introduction to Oracle, and allow a table to be indexed on a column value of a
linked table – useful stuff if it actually works. There are a few hoops that
have to be jumped through to do so, but don’t let your enthusiasm for them lull
you into the belief that they can rescue you from an over-snowflaked schema. They
will help, but the best help you can give yourself is to keep all your
dimensional values as "close" to the fact table as possible.