It always seems like the creation of indexes on a database
are always the last thing developers and database designers think about when
designing a database. Indexes are almost an afterthought. Developers seem to
only place indexes on tables after they notice one of their queries are running
slow. In addition, many times that does not happened until they place their
applications into production. In my prior article, Database Indexing Development Lifecycle...Say What?,
I talked about how it is important to have an index development lifecycle. This
lifecycle allows you to design, build and maintain indexes throughout the
entire application development lifecycle. In this article, I am going to
expand on my indexing guideline discussion to address some of the dos and
donts of indexing that will help you pick reasonable indexes at design time.
General Dos and Donts
You do not want to place too many indexes on tables that are
frequently updated. A large number of INSERT, UPDATE, and DELETE statements
being performed on a table will require the indexes on a table to be updated
each time a column value is changed for one of the index columns. The more
indexes you have on a table, the longer and more work the INSERT, UPDATE and
DELETE statements will take. By minimizing the number of indexes you have on
highly updated tables you can optimizes those update statements. Of course, if
you have tables that are infrequently update then there is less of an impact in
having a lot of indexes.
The size of the index key you pick is also important. A
large index key will cause your index to take up more disk space. The bigger
the index the more I/O that is required to read the index b-tree structure and
the less index values return with each I/O operation. More I/O equates to
slowing queries. Therefore, you want to pick index columns that are narrow in
size.
Query Considerations that Affect Indexing
Indexes are useful for helping queries that have JOINs and
WHERE clauses. You should review the kinds of queries your application is
running. You need to look for those queries that are frequently run and that
have JOIN and WHERE logic. Look at the columns used in the JOIN and WHERE
clause. You should consider these columns for building your clustered and
non-clustered indexes. By creating indexes on the columns that are used on the
JOIN and WHERE clause will greatly enhance the performance of these frequently
run queries.
Be careful not to create indexes for queries that are only
run occasionally. Indexes that a created to support queries that are run
infrequently, may cost more to maintain in the long run than a table scan ever
once in a while. Knowing which of your application commands are being more
frequently executed will help guide you as to which indexes to build. Knowing how
your users use your application code can help you identify frequent and
infrequently executed TSQL code, but you can also rely on DVMs to obtain query
statistics and missing index statistics to will help with this analysis.
Understanding how to use DMVs for this analysis is outside the scope of this
article.
Creating covering indexes can greatly improve performance.
A covering index is an index that contains all of the columns within your
query, even if the columns are only being referenced in the selection list and
are not contained in a JOIN or WHERE predicate. With SQL Server 2005 you can
create covering indexes to include those covering columns you dont need in
your search criteria by using the INCLUDE option within the CREATE INDEX
statement. The INCLUDE option is a new indexing feature that was introduced in
SQL Server 2005. If you are migrating, or have migrated your databases from
SQL Server 2000 to SQL Server 2005 or 2008 you should recreate your covering
indexes to use the INCLUDE clause. By doing, this you will minimize the space
that SQL Server requires for the index. Not only that, but you will also
improve the performance of the process of searching through those covering
indexes. This is because the values of the columns in the INCLUDE clause are
only stored in the leaf level pages. This makes your upper level index pages
more compact, because they only contain the real index column values. This
means less I/Os are required to search through those upper index pages to
resolve the JOIN and WHERE predicates.
If you are using the ORDER BY clause so the data is returned
to your application in a particular sorted order, then you should consider
building indexes on the columns used in the ORDER BY clause. Having an index to support the
ORDER BY column might help optimize your query. When the data in the index is
sorted in the same order as you want your data returned, as specified by the
ORDER BY clause, SQL Server can just go right through the leaf level pages
returning information without having to perform a SORT operation prior to
returning the data. In order for you to see what I am talking about, you can
run the following code from within a query window within SQL Server Management
Studio while having the Include Actual Execution Plan item enabled:
USE AdventureWorks;
GO
SELECT CreditRating, Name FROM Purchasing.Vendor
ORDER BY CreditRating;
CREATE NONCLUSTERED INDEX [IDX_Vendor_CreditRating]
ON [Purchasing].[Vendor]
(
[CreditRating] ASC
)
INCLUDE ( [Name]);
SELECT CreditRating, Name FROM Purchasing.Vendor
ORDER BY CreditRating;
DROP INDEX Purchasing.Vendor.IDX_Vendor_CreditRating;
When you run this code, you will notice the first SELECT
statement requires a SORT operation to return the data from the Purchase.Vendor
table in CreditRating order, whereas once the non-clustered index
IDX_Vendor_CreditRating is created, the second SELECT statement does not
require a sort operation. In this example, very few records need to be
sorted, but if SQL Server had to perform a SORT operation on a large record
set, then the resources might be substantial.
Clustered Index Consideration
Selecting your clustered index key columns is very important. Remember a
clustered index contains both the index key values and the all the data for every
column in the table. In addition, the data is stored in sorted order based on
the clustered index key. Here are some guidelines to consider when selecting a
clustered index key.
First, your clustered index key should be as short as
possible (number of bytes). This is because the clustered index key value is
used in all non-clustered indexes to tie together the non-clustered indexes
values to the clustered index rows based on the clustered index key. This is
so SQL Server can quickly identify the clustered index leaf pages needed to
retrieve the additional columns not contained in the non-clustered index. The
larger the length of your clustered index key the larger the non-clustered
index will be, and the more I/O that will be required when reading the
non-clustered leaf and clustered index pages.
Your clustered index key columns should be used frequently
in search criteria, and should be highly selective. A highly selective column
is a column that has a large number of different values, where each unique
value is only associated with a small percentage of total rows. The more
selective the column the fewer index pages that will need to be read when
searching for rows that meet a particular search criteria. Imagine how much of
an index would have to be read if you had a clustered index on say SEX that had
a value of M and F. 50% of the index would need to be read to find all
the records where M was the SEX if there was an even distribution of males
and females. Sex would not be a very selective column to pick.
Columns that are used in the GROUP BY clause can be good
candidates for clustered index key columns. Your data will be sorted by the
key columns. The GROUP BY clause requires the results set be sorted first,
prior to performing the GROUP BY aggregate functions. Therefore, when
grouping on the clustered index key columns, SQL Server will not have to
perform a sort operation prior to grouping the data.
Columns that are accessed sequentially make good candidates
for clustered index keys. One example of sequential processing might be using
the BETWEEN clause in a WHERE statement. If the column used in the BETWEEN
clause is the clustered index key, then SQL Server can just find the first leaf
page associated with the starting value of the BETWEEEN clause, and then just
start reading the leaf pages until it gets to the last value of the BETWEEN
clause. The clustered index key makes an extremely efficient access method
for operations that need to just chunk through sorted data until it gets to
some stopping point. Since the clustered index data is sorted based on the key
columns, it is useful for these sequential processing types of queries.
In order to avoid page splits you should select clustered
index key columns that are infrequently updated. Remember the data for a row
is stored in the leaf pages of the clustered index in clustered key order. If
the value of the clustered index key column values changes then all of the data
for the row will need to be moved to another location based on the cluster
index key. Moving of this data will cause your clustered index data to become
fragmented (larger), and will start slowing down queries that use the clustered
index because they will need to perform more I/O to read the split index pages.
Picking the Right Index Columns
Picking the right clustered index columns does not happen by
accident. You need to evaluate how your applications and/or users are
accessing your data. By understanding, what columns are being used in the JOIN
and WHERE clauses and understanding the distribution of column values will help
you make good choices when developing your clustered index key.
»
See All Articles by Columnist Gregory A. Larsen