Clustering for Indexes
November 4, 2004
There is nothing spectacular about using indexes per se. However, on many occasions I have come across a variety of SQL coders that never consider validating that the index they think they are using is efficient or even being used at all. We can all put indexes on the columns that we think will be required to satisfy individual queries, but how do we know if they will ever be used. You see, if the underlying table data is constructed, contains, or is ordered in a particular way, our indexes may never be used. One of the factors around the use of an index is its clustering factor and this is what this article is about.
The dictionary.com definition of clustering is "A group of the same or similar elements gathered or occurring closely together." This is not too far from what we mean by clustering for an index when relating it to databases. If you were to extract a definition from the Oracle documentation, you would get to a concept of Clustering Factor, which is defined as "the amount of order of the rows in the table based on the values of the index." So what is a good clustering factor as opposed to a bad clustering factor? As the clustering factor approaches and reaches the number of blocks in the table, the index is considered to have good clustering. You can quickly see that if an index block points to a table data block, then all of the pointers in the index are pointing to table data that is very closely related and thus we will get very good response from using the index. On the other hand, poor or bad clustering is very much the opposite. As the clustering factor approaches the number of rows in the table, the worse our usability of the index becomes. This means that the entries in each of the index blocks are in different blocks in the table data blocks, our access to the data could not be sequential, and we loose all ability to scan table data in order of the index. Figure 1 gives a good picture of how good clustering is nicely ordered when accessing data blocks from the index blocks. Figure 2 shows how the index to data blocks is random for bad clustering.
Effects on performance
db file sequential &
Buffer Busy Waits
3. SQL execution path
Structure changes for cluster performance
Order of columns in an index
2. Load your data in the order you have set up the columns for the index(es). If you have more than one index on an object, you should revert to the business rules and SLAs around the queries that are most important. You can accomplish this by any method you like but it does require a rebuild (reorganization) of the table. If you have ETL processes, try to pre-sort the data as it will improve performance and limit the complaints latter.
3. Add data needed in the table to the index. While this is more of a hack than a fix, if you cannot get to reorganize your data in the table this may be your only choice. You may also have a situation where the order in the table is great for a set of indexes but is totally out of order for another. Adding a column or two in these situations is a life saver.
Where to look at clustering
Having a good clustered index on a table reduces the time to execute and resources required to produce result sets. This is important as other queries may starve for those resources. Asking yourself a few questions about the access paths and data required to ask well-defined business questions puts you closer to discovering the true indexes required within your databases. We all know that this is not a perfect world and often we inherit bad designs. By looking at the clustering factor, it only takes a moment, we can quickly gain insight into how the data looks in the objects being queried and thus possible performance issues we need to deal with.