Exploring SQL 2005's Ranking Functions - NTILE() and ROW_NUMBER()October 24, 2008 This is part 20 of a series. If you have not read part 19, I suggest starting here. That article is the first of this two-part coverage of ranking functions. Table and Data for IllustrationThis article will use the same table and data as Part 19, so I wont cover that again. The code to build and populate the table is here. NTILE()NTILE() breaks the results into groups. In the simplest case, if we pass the integer 2 to NTILE() using Weight, it will rank each entry as in either the first or second half based on Weight. NTILE() with ORDER BY OnlyQuery:
SELECT
Category,
Weight,
Entrant,
NTILE(2) OVER (
ORDER BY Weight DESC
) AS [Ntile]
FROM dbo.ContestResults
Partial Results:
Category Weight Entrant Ntile
---------- ------ --------------- -----
Pumpkin 716 Chad Johnson 1
Pumpkin 679 George Kopsell 1
...
Watermelon 132 Meg Giry 2
Watermelon 129 Joseph Buquet 2
This is valid, but its not terribly interesting or valuable. Lets add a PARTITION BY clause so we get top- and bottom-half entries for each category. NTILE() with PARTITION BY and ORDER BYQuery:
SELECT
Category,
Weight,
Entrant,
NTILE(2) OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS [Ntile]
FROM dbo.ContestResults
Partial Results:
Category Weight Entrant Ntile
---------- ------ --------------- --------------------
Pumpkin 716 Chad Johnson 1
...
Pumpkin 229 Harvey Zale 2
Squash 462 Dan Gardner 1
...
Squash 405 Gary Spiel 2
Watermelon 146 Mark Bardin 1
...
Watermelon 129 Joseph Buquet 2
Now we have something useful. Passing Different int Values to NTILE()Passing 1 to NTILE() is worthless; it gives us 1 for each entry. But what if we pass a larger number to NTILE() than the number of result set rows? In our test table, we have four entries in the squash category. Lets try passing different values to NTILE(), starting with 2. Query:
SELECT
...
NTILE(2) OVER (
ORDER BY Weight DESC
) AS [Ntile]
FROM dbo.ContestResults
WHERE Category = 'Squash'
Results:
Category Weight Entrant Ntile
-------- ------ ------------ -----
Squash 462 Dan Gardner 1
Squash 462 Harvey Zale 1
Squash 435 Terry Helmer 2
Squash 405 Gary Spiel 2
Sure, this splits the entries into two halves. Now what about 3?
Category Weight Entrant Ntile
-------- ------ ------------ -----
Squash 462 Dan Gardner 1
Squash 462 Harvey Zale 1
Squash 435 Terry Helmer 2
Squash 405 Gary Spiel 3
Notice that the first two entries get tile 1 and have the same weight. This is interesting, but its not consistent. I will explain the uneven distribution shortly. If we switch to 4, we get one tile per entry as expected. Any integer above 4 will return the same results for this set as 4.
Category Weight Entrant Ntile
-------- ------ ------------ -----
Squash 462 Dan Gardner 1
Squash 462 Harvey Zale 2
Squash 435 Terry Helmer 3
Squash 405 Gary Spiel 4
The tiling will split records that have the same weight into different tiles. Here, we are using 7 as the tile integer and pulling the thirteen pumpkin entries.
Category Weight Entrant Ntile
-------- ------ -------------- -----
Pumpkin 716 Chad Johnson 1
Pumpkin 679 George Kopsell 1
Pumpkin 679 Dan Gardner 2
Pumpkin 481 John Suydam 2
....
Pumpkin 229 Harvey Zale 7
Note that two entries have the exact same weight (679) but are split between tile 1 and 2. I tried rebuilding the table and putting Dan Gardners entry in first and, as you might expect, the tiles were swapped. So, for entries with equal weight, the order of the base records determines the tiling. Another pattern is that if you have n records in your result set and n-1 as your ntile integer, the first two entries always have a tile of 1. Dealing Cards The distribution is actually very simple; its just like dealing cards. Think of the tiles like card players. If you dealt five cards to four players, the first player would have an extra card. If you dealt one more card, it would go to the second player. If you think of the distribution of tiles like that, it will match the results. Using a Variable for the NTILE() int Value You can use a variable for the NTILE int value. This has clear value. Query:
DECLARE @i int
SET @i = 5
SELECT
...
NTILE(@i) OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS [Ntile]
FROM dbo.ContestResults
ROW_NUMBER()ROW_NUMBER() numbers the rows of the result set. Here is the simplest case. ROW_NUMBER() with ORDER BY OnlyQuery:
SELECT
...
ROW_NUMBER() OVER (
ORDER BY Weight DESC
) AS [RowNumber]
FROM dbo.ContestResults
Partial Results:
Category Weight Entrant RowNumber
---------- ------ ------------- ---------
Pumpkin 716 Chad Johnson 1
Pumpkin 679 Dan Gardner 2
...
Watermelon 132 Meg Giry 22
Watermelon 129 Joseph Buquet 23
Valid, yes, but fairly worthless. Adding a PARTITION BY clause starts to show the value. ROW_NUMBER() with PARTITION BY and ORDER BYQuery:
SELECT
...
ROW_NUMBER() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS [RowNumber]
FROM dbo.ContestResults
Partial Results:
Category Weight Entrant RowNumber
---------- ------ -------------- --------------------
Pumpkin 716 Chad Johnson 1
Pumpkin 679 Dan Gardner 2
...
Pumpkin 247 Harvey Zale 12
Pumpkin 229 Harvey Zale 13
Squash 462 Dan Gardner 1
Squash 462 Harvey Zale 2
Squash 435 Terry Helmer 3
Squash 405 Gary Spiel 4
Watermelon 146 Mark Bardin 1
Watermelon 139 Christine Daaé 2
...
Watermelon 132 Meg Giry 5
Watermelon 129 Joseph Buquet 6
Now we have row numbers starting at 1 for each category. This will allow us, for instance, to pull the top three entries from each category. This is really the best choice to fulfill our original task of picking who receives prizes. Using ROW_NUMBER() with PARTITION BY and ORDER BY with a Common Table ExpressionJust like RANK() and DENSE_RANK(), you cant do this:
SELECT
...
ROW_NUMBER() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS RowNumber
FROM dbo.ContestResults
WHERE RowNumber <= 3
Common Table Expression:
WITH t
AS (
SELECT
...
ROW_NUMBER() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS RowNumber
FROM dbo.ContestResults
)
SELECT
...
RowNumber
FROM t
WHERE RowNumber <= 3
ORDER BY Category, RowNumber
Results:
Category Weight Entrant RowNumber
---------- ------ ------------------- ---------
Pumpkin 716 Chad Johnson 1
Pumpkin 679 Dan Gardner 2
Pumpkin 679 George Kopsell 3
Squash 462 Dan Gardner 1
Squash 462 Harvey Zale 2
Squash 435 Terry Helmer 3
Watermelon 146 Mark Bardin 1
Watermelon 139 Christine Daaé 2
Watermelon 139 Carlotta Giudicelli 3
A Real-World ROW_NUMBER() Example: Most Recent Entry by PersonI have had to use ROW_NUMBER() multiple times recently for a specific requirement. We have a user action that happens multiple times, and we have to pull only the most recent entry for each person. ROW_NUMBER() is a huge winner here. Say were storing data for a pediatrician and need to pull information for the most recent visits by patients. This is a simple example, but you can see the power of using ROW_NUMBER() here. Query:
DECLARE @records TABLE (
Person int,
VisitDate datetime,
Purpose varchar(10)
)
INSERT INTO @records (Person, VisitDate, Purpose) VALUES ...
(100, '1992-12-31', 'Birth')
(100, '1993-01-12', 'CheckUp')
(100, '1993-03-22', 'Shots')
(100, '1993-04-19', 'EarInf')
(101, '2003-01-29', 'Birth')
(101, '2003-02-03', 'CheckUp')
(102, '2008-10-13', 'Birth')
SELECT * FROM @records
;WITH t
AS (
SELECT
Person,
VisitDate,
Purpose,
ROW_NUMBER() OVER (
PARTITION BY Person
ORDER BY VisitDate DESC
) AS RowNumber
FROM @records
)
SELECT
Person,
VisitDate,
Purpose
FROM t
WHERE RowNumber = 1
ORDER BY Person
Result:
Person VisitDate Purpose
------ ---------- -------
100 1992-12-31 Birth
100 1993-01-12 CheckUp
100 1993-03-22 Shots
100 1993-04-19 EarInf
101 2003-01-29 Birth
101 2003-02-03 CheckUp
102 2008-10-13 Birth
Person VisitDate Purpose
------ ---------- -------
100 1993-04-19 EarInf
101 2003-02-03 CheckUp
102 2008-10-13 Birth
This gives us exactly what we want with very readable code. ConclusionNTILE() may have a sweet-spot in some applications, but I dont see it being as useful as ROW_NUMBER(). I think ROW_NUMBER() is very useful. If you are a big fan of NTILE() and can explain where it is useful, please drop a comment in the forum. You can also add comments there about ROW_NUMBER() or any of the ranking functions. So far, no one has answered the challenge I posed in Part 19. Check it out and see if you can be the one that answers it. » See All Articles by Columnist Rob Garrison SqlCredit - Developing a Complete SQL Server OLTP Database Project
Performance Testing SQL 2008's Transparent Data Encryption
SQL Server 2008's Change Data Capture - Tracking the Moving Parts Performance Testing - SQL Server 2008 versus SQL Server 2005 Exploring SQL Server's Index INCLUDEs Exploring SQL 2005's Ranking Functions - NTILE() and ROW_NUMBER() SqlCredit - Part 19: Exploring SQL 2005's Ranking Functions - RANK() and DENSE_RANK() SqlCredit, Part 18: Exploring the Performance of SQL 2005's OUTPUT Clause SqlCredit - Part 17: Exploring SQL 2005's OUTPUT Clause SqlCredit - Part 16: The Cost of Bloat SqlCredit - Part 15: The Cost of Distribution SqlCredit - Part 14: The Cost of Translation SqlCredit - Part 13: More on Indexed Persisted Computed Columns SqlCredit - Part 12: Exploring Indexed Persisted Computed Columns SqlCredit - Part 11: Change Tracking Using History Records SqlCredit - Part 10: MAC Performance and Updating SqlCredit SqlCredit - Part 9: Message Authentication Codes SqlCredit Part 8: Comparing Encrypt/DecryptByCert and Encrypt/DecryptByKey SqlCredit Part 7: Performance Impact of EncryptByCert and DecryptByCert SqlCredit Part 6: Exploring EncryptByCert and DecryptByCert SqlCredit - Part 5: Adding Card, Vendor, and Purchase, Plus Much Refactoring SqlCredit - Part 4: Schema and Procedure Security SqlCredit - Part 3: Schema/Proc Updates and Automated Unit Testing SqlCredit - Part 2: Creating the Database, Tables, CRUD Procedures SqlCredit - Developing a Complete SQL Server OLTP Database Project |