Exploring SQL Server's Index INCLUDEs
November 28, 2008
Download the .sql files for this article.
This is part 21 of the "SqlCredit Developing a Complete SQL Server OLTP Database Project" series. If you have not read part 20, you can find it here, but this article does not require any knowledge of previous work in this series.
Please ... I Can Read Books-Online
Many articles have been written about index INCLUDEs, so I dont want to cover the same things. My specific interests here are to understand how index INCLUDEs affect performance and provide a set of recommendations for when to use them.
First, the Basics Standard Index Restrictions
Standard indexes have two restrictions that affect their usefulness: the 900-byte limit and the sixteen-column limit. Here is a relatively simple script that illustrates these limitations.
Interesting Results from the Script
Creating an index on a 901-character (ASCII) or 451-character (UNICODE) column gives a warning but is allowed. If you actually attempt to write a 901/451-character string into that column, the INSERT will fail.
Creating an index on a 900-character (ASCII) or 450-character (UNICODE) column gives no warning as expected. If you write a 901/451-character string into that column, the INSERT succeeds with a warning (String or binary data would be truncated.).
Attempting to add an index to a varchar(MAX) or nvarchar(MAX) column will bring this error: Column 'ColAsciiMax' in table 'dbo.LimitTest1' is of a type that is invalid for use as a key column in an index.
The Basics of Index INCLUDEs
Creating an index with an included column is quite simple:
CREATE NONCLUSTERED INDEX <indexName> on <tableName> (<KeyColumns>) INCLUDE (<IncludedColumns>)
Throughout this article and the supporting scripts, I will use KeyColumn(s) to indicate the the column or columns on which the index is based and IncludedColumn(s) to indicate non-key columns to be added to the leaf level of the nonclustered index.
In the LimitTest script, I built an index with 23 IncludedColumns. Two of those columns are of a length that would have caused warnings, and two of the columns are of a type ([n]varchar(MAX)) that would have caused a normal index to fail. I inserted 1,000,000 characters into each of the MAX columns to prove they can hold more than 900 bytes as part of an INCLUDE column. The execution plan for this simple query on a one-row table shows that the engine chooses the INCLUDE index over the PK to return the two MAX columns:
SELECT LEN(ColAsciiMax) AS LenColAsciiMax, LEN(ColUnicodeMax) AS LenColUnicodeMax FROM dbo.LimitTest2
Duplicating the Clustered Index or a Part of It
You can duplicate the clustered index key or part or all of a multi-column clustered key in a non-clustered index with IncludedColumns to cover frequently-used queries. At first, this may seem useless, but it is a very interesting way to use INCLUDEs.
These tests are based on this Person table in which I purposely created wide rows:
CREATE TABLE dbo.Person ( PersonID int NOT NULL IDENTITY, PersonGUID uniqueidentifier NOT NULL DEFAULT newid(), FirstName nvarchar(40) NOT NULL, MiddleName nvarchar(40) NULL, LastName nvarchar(40) NOT NULL, Address1 nvarchar(40) NOT NULL, Address2 nvarchar(40) NULL, Address3 nvarchar(40) NULL, Address4 nvarchar(40) NULL, City nvarchar(40) NOT NULL, StateProvince nvarchar(40) NOT NULL, CountryCode char(2) NOT NULL, PostalCode nvarchar(20) NOT NULL, LoginName nvarchar(40) NOT NULL, PasswordHash char(43) NOT NULL, BirthDate datetime NOT NULL, Age int NOT NULL, Filler1 nvarchar(4000) NOT NULL, Filler2 nvarchar(4000) NOT NULL, Filler3 nvarchar(4000) NOT NULL, Filler4 nvarchar(4000) NOT NULL, CONSTRAINT PK_Person PRIMARY KEY CLUSTERED ( PersonID ) ) CREATE NONCLUSTERED INDEX IX_Person_PersonID_FirstName_LastName_Age_WithIncl ON dbo.Person ( PersonID ) INCLUDE ( FirstName, LastName, Age )
For the tests, the table is populated with one million rows.
Say you have an often-used query that selects FirstName, LastName, and Age based on the PersonID (the clustered index key). Adding an index on PersonID with IncludedColumns of FirstName, LastName, and Age will allow the optimizer to use a highly-selective, narrow, covering index to fulfill the query.
When Will This INCLUDE Index Be Used?
Range Queries (WHERE ... BETWEEN ...)
For single-row queries, the optimizer consistently uses the clustered primary key. If I force it to use the INCLUDE index, the plan shows the same cost for both queries, and STATISTICS IO shows exactly the same results most of the time. Occasionally it shows one fewer logical read for the INCLUDE index (depending on the number of rows in the table).
I changed the WHERE clause from
WHERE PersonID = @personID
WHERE PersonID BETWEEN @personID AND (@personID + 0)
Using index hints, I forced one version of the query to use the clustered PK and one to use the INCLUDE index, regardless of what the optimizer might choose. With the switch detailed above, the performance changed dramatically. In this plus zero query, the Estimated Number of Rows jumps from 1 to 90,000 for the query using the clustered PK. The batch showed the clustered PK query taking 81% and the INCLUDE query taking just 19%. The STATISTICS IO results, however, matched exactly.
If the range is changed from zero to 999 (returning 1,000 records), the logical read count from STATISTICS IO is 95 (PK) versus 24 (INCLUDE).
The lesson? When querying for a range of rows, the optimizer will pick the INCLUDE index regardless of the size of the range.
List Queries (WHERE ... IN ...)
If the query was changed from
WHERE PersonID = @personID
WHERE PersonID IN (@personID)
the optimizer continued to use the clustered PK when there was just one record in the list. If I forced the INCLUDE index, the statistics and plans matched exactly (except the index used).
As soon as I added a second entry to the list, the optimizer chose the INCLUDE index instead of the clustered index. This remained true as the list grew. When I grew the list to 270 entries (1, 101, 201, ...), the statistics looked like this:
Clustered Index: Scan count 0, logical reads 1532, physical reads 1, read-ahead reads 279, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. INCLUDE Index: Scan count 0, logical reads 1372, physical reads 1, read-ahead reads 270, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
This is not a huge win, but it is a definite improvement. Called enough times, this could be significant.
Create Multiple Indexes With the Same Keys and Different INCLUDEs?
What if you had a static table of reference data with a cluster on ColID and three different frequently-called queries that return four columns each, none of which overlaps?
Since this is reference data, you should consider covering each of these queries. The question is, do you cover all three queries with one index on ColID that INCLUDEs all nine columns, or do you create three separate columns on ColID that INCLUDE only three of the columns (one to cover each query)?
Remember that one of the benefits of using a non-clustered index with INCLUDEs is that more index entries will fit on a single page. You will have to consider the cost of multiple indexes against the benefits of having lighter-weight indexes. Either choice is reasonable, but both should be considered.