A Practical Guide to Data Warehousing in Oracle, Part 4
March 30, 2004
On a regular OLTP system, one of the perennial sources for doubt and debate is the number, type and configuration of indexes, and for good reason. An OLTP system generally runs with some limited set of select queries and a range of updates, inserts, and deletes. A balance has to be struck between providing rapid access to data for all of these operations - the selects in particular, I suppose - and avoiding an excess of overhead when the delete, insert, or update actually modifies the table, and hence the indexes.
A data warehouse, once again, is a different matter. Firstly, there is probably no limited set of select statements that will be used - ad hoc queries are by their very nature rather unpredictable. Secondly, the system is probably not attempting to maintain indexes at the same time, as data is inserted row-by-row into the tables.
Therefore, here are some simple principles you can apply to the business of indexing your data warehouse.
What Columns Should Be Indexed?
This is pretty straightforward - if a column might be used as a predicate or join, then index it. This probably means that nearly every column is going to be indexed. You might be able to eliminate metric columns from consideration if you are sure that no one is going to want to get a list of all retail transactions of greater than $1,000 for example.
What Sort of Index Should I Use?
We can start by eliminating from consideration any reverse key indexes - we do not have much call for them as they are primarily intended for supporting high-concurrency insert operations - not the case here, and they also do not support range scans.
The two major types to consider are Bitmap indexes and B-Tree indexes. B-Tree indexes are the regular type that OLTP systems make much use of, and bitmap indexes are a highly compressed index type that tends to be used primarily for data warehouses.
When using b-tree indexes in a data warehouse you would do well to consider use of the COMPRESS parameter. This causes compression of repeated entries within an index block (and hence a larger block size can improve compression ratios). Compressed indexes, like compressed data segments, represent a trade-off between CPU usage and disk space usage. A compressed structure is faster to read from disk but takes additional CPU cycles to decompress for access - an uncompressed structure imposes a lower CPU load but requires more bandwidth to read in a short time.
A bitmap index can be thought of as representing this trade-off carried even further - the internal structure is very different, and allows a high degree of compression. The reason for confining bitmap indexes to data warehouses is that the overhead on maintaining them is enormous. A modification to a bitmap index requires a great deal more work on behalf of the system than a modification to a b-tree index. In addition, the concurrency for modifications on bitmap indexes is dreadful.
The advantages of them are that they have a highly compressed structure, making them fast to read and their structure makes it possible for the system to combine multiple indexes together for fast access to the underlying table.
One belief concerning bitmap indexes is that they are only suitable for indexing low-cardinality data. This is not necessarily true, and bitmap indexes can be used very successfully for indexing columns with many thousands of different values.
There does come a point however, at which the cardinality rises high enough for the bitmap index to start losing some of its edge over b-tree indexes. Exactly where this point is depends on the data set itself. Not just on the number of distinct values and the number of totals rows, but on the physical order of the rows as well. Bitmap indexes respond very well to physically sorted data sets, and can shrink dramatically in response to physical ordering.
If you suspect that you are getting close to the line between using b-tree and bitmap indexes then benchmark the two - start by comparing their relative sizes, then try comparing performance on some sample queries.
Single Column or Composite Indexes?
In general, I tend to go for single column indexes simply because they are more flexible. I like the idea of index skip-scans as a mechanism for accessing non-leading columns of a composite index, but the cost of finding that the optimizer is not doing so is too high in comparison to any overhead in maintaining multiple separate indexes for me to be comfortable with them.
Having said that, there are instances where one column gets a mention in nearly every report query. Generally, it is a date, and in such cases, you might like to bite the bullet and start building some composite indexes on the columns commonly used with such a column.
What you are generally saving here is the cost of joining two indexes in a bitmap merge operation, and it is often another case for benchmarking.
How About Those Bitmap Join Indexes?
These are a fairly recent introduction to Oracle, and allow a table to be indexed on a column value of a linked table - useful stuff if it actually works. There are a few hoops that have to be jumped through to do so, but don't let your enthusiasm for them lull you into the belief that they can rescue you from an over-snowflaked schema. They will help, but the best help you can give yourself is to keep all your dimensional values as "close" to the fact table as possible.