The Dos and Don'ts of Database Indexing
January 29, 2010
It always seems like the creation of indexes on a database are always the last thing developers and database designers think about when designing a database. Indexes are almost an afterthought. Developers seem to only place indexes on tables after they notice one of their queries are running slow. In addition, many times that does not happened until they place their applications into production. In my prior article, Database Indexing Development Lifecycle...Say What?, I talked about how it is important to have an index development lifecycle. This lifecycle allows you to design, build and maintain indexes throughout the entire application development lifecycle. In this article, I am going to expand on my indexing guideline discussion to address some of the dos and donts of indexing that will help you pick reasonable indexes at design time.
General Dos and Donts
You do not want to place too many indexes on tables that are frequently updated. A large number of INSERT, UPDATE, and DELETE statements being performed on a table will require the indexes on a table to be updated each time a column value is changed for one of the index columns. The more indexes you have on a table, the longer and more work the INSERT, UPDATE and DELETE statements will take. By minimizing the number of indexes you have on highly updated tables you can optimizes those update statements. Of course, if you have tables that are infrequently update then there is less of an impact in having a lot of indexes.
The size of the index key you pick is also important. A large index key will cause your index to take up more disk space. The bigger the index the more I/O that is required to read the index b-tree structure and the less index values return with each I/O operation. More I/O equates to slowing queries. Therefore, you want to pick index columns that are narrow in size.
Query Considerations that Affect Indexing
Indexes are useful for helping queries that have JOINs and WHERE clauses. You should review the kinds of queries your application is running. You need to look for those queries that are frequently run and that have JOIN and WHERE logic. Look at the columns used in the JOIN and WHERE clause. You should consider these columns for building your clustered and non-clustered indexes. By creating indexes on the columns that are used on the JOIN and WHERE clause will greatly enhance the performance of these frequently run queries.
Be careful not to create indexes for queries that are only run occasionally. Indexes that a created to support queries that are run infrequently, may cost more to maintain in the long run than a table scan ever once in a while. Knowing which of your application commands are being more frequently executed will help guide you as to which indexes to build. Knowing how your users use your application code can help you identify frequent and infrequently executed TSQL code, but you can also rely on DVMs to obtain query statistics and missing index statistics to will help with this analysis. Understanding how to use DMVs for this analysis is outside the scope of this article.
Creating covering indexes can greatly improve performance. A covering index is an index that contains all of the columns within your query, even if the columns are only being referenced in the selection list and are not contained in a JOIN or WHERE predicate. With SQL Server 2005 you can create covering indexes to include those covering columns you dont need in your search criteria by using the INCLUDE option within the CREATE INDEX statement. The INCLUDE option is a new indexing feature that was introduced in SQL Server 2005. If you are migrating, or have migrated your databases from SQL Server 2000 to SQL Server 2005 or 2008 you should recreate your covering indexes to use the INCLUDE clause. By doing, this you will minimize the space that SQL Server requires for the index. Not only that, but you will also improve the performance of the process of searching through those covering indexes. This is because the values of the columns in the INCLUDE clause are only stored in the leaf level pages. This makes your upper level index pages more compact, because they only contain the real index column values. This means less I/Os are required to search through those upper index pages to resolve the JOIN and WHERE predicates.
If you are using the ORDER BY clause so the data is returned to your application in a particular sorted order, then you should consider building indexes on the columns used in the ORDER BY clause. Having an index to support the ORDER BY column might help optimize your query. When the data in the index is sorted in the same order as you want your data returned, as specified by the ORDER BY clause, SQL Server can just go right through the leaf level pages returning information without having to perform a SORT operation prior to returning the data. In order for you to see what I am talking about, you can run the following code from within a query window within SQL Server Management Studio while having the Include Actual Execution Plan item enabled:
USE AdventureWorks; GO SELECT CreditRating, Name FROM Purchasing.Vendor ORDER BY CreditRating; CREATE NONCLUSTERED INDEX [IDX_Vendor_CreditRating] ON [Purchasing].[Vendor] ( [CreditRating] ASC ) INCLUDE ( [Name]); SELECT CreditRating, Name FROM Purchasing.Vendor ORDER BY CreditRating; DROP INDEX Purchasing.Vendor.IDX_Vendor_CreditRating;
When you run this code, you will notice the first SELECT statement requires a SORT operation to return the data from the Purchase.Vendor table in CreditRating order, whereas once the non-clustered index IDX_Vendor_CreditRating is created, the second SELECT statement does not require a sort operation. In this example, very few records need to be sorted, but if SQL Server had to perform a SORT operation on a large record set, then the resources might be substantial.
Clustered Index Consideration
Selecting your clustered index key columns is very important. Remember a clustered index contains both the index key values and the all the data for every column in the table. In addition, the data is stored in sorted order based on the clustered index key. Here are some guidelines to consider when selecting a clustered index key.
First, your clustered index key should be as short as possible (number of bytes). This is because the clustered index key value is used in all non-clustered indexes to tie together the non-clustered indexes values to the clustered index rows based on the clustered index key. This is so SQL Server can quickly identify the clustered index leaf pages needed to retrieve the additional columns not contained in the non-clustered index. The larger the length of your clustered index key the larger the non-clustered index will be, and the more I/O that will be required when reading the non-clustered leaf and clustered index pages.
Your clustered index key columns should be used frequently in search criteria, and should be highly selective. A highly selective column is a column that has a large number of different values, where each unique value is only associated with a small percentage of total rows. The more selective the column the fewer index pages that will need to be read when searching for rows that meet a particular search criteria. Imagine how much of an index would have to be read if you had a clustered index on say SEX that had a value of M and F. 50% of the index would need to be read to find all the records where M was the SEX if there was an even distribution of males and females. Sex would not be a very selective column to pick.
Columns that are used in the GROUP BY clause can be good candidates for clustered index key columns. Your data will be sorted by the key columns. The GROUP BY clause requires the results set be sorted first, prior to performing the GROUP BY aggregate functions. Therefore, when grouping on the clustered index key columns, SQL Server will not have to perform a sort operation prior to grouping the data.
Columns that are accessed sequentially make good candidates for clustered index keys. One example of sequential processing might be using the BETWEEN clause in a WHERE statement. If the column used in the BETWEEN clause is the clustered index key, then SQL Server can just find the first leaf page associated with the starting value of the BETWEEEN clause, and then just start reading the leaf pages until it gets to the last value of the BETWEEN clause. The clustered index key makes an extremely efficient access method for operations that need to just chunk through sorted data until it gets to some stopping point. Since the clustered index data is sorted based on the key columns, it is useful for these sequential processing types of queries.
In order to avoid page splits you should select clustered index key columns that are infrequently updated. Remember the data for a row is stored in the leaf pages of the clustered index in clustered key order. If the value of the clustered index key column values changes then all of the data for the row will need to be moved to another location based on the cluster index key. Moving of this data will cause your clustered index data to become fragmented (larger), and will start slowing down queries that use the clustered index because they will need to perform more I/O to read the split index pages.
Picking the Right Index Columns
Picking the right clustered index columns does not happen by accident. You need to evaluate how your applications and/or users are accessing your data. By understanding, what columns are being used in the JOIN and WHERE clauses and understanding the distribution of column values will help you make good choices when developing your clustered index key.