This is part 19 of a series. If you have not read part 18,
you can find it here,
but this article does not require any knowledge of the previous work in the
series.
Please … I Can Read Books Online
There are many articles and blogs that discuss SQL Server
2005’s ranking
functions (RANK(), DENSE_RANK(), NTILE(), and ROW_NUMBER()). I will try to
cover a lot more here than you will find in a quick read of Books Online.
Table and Data for Illustration
To illustrate these functions, I will be using a simple
table that stores the results of a vegetable-growing contest. I don’t know if
this kind of contest happens outside the US, but in the rural United States,
the grower that has the biggest pumpkin at the county fair is a real celebrity.
The biggest pumpkin here is 716 pounds, or 325 kg!
Here is the table:
CREATE TABLE dbo.ContestResults (
ColID int NOT NULL IDENTITY,
Category varchar(10) NOT NULL,
Weight int NOT NULL,
Entrant nvarchar(20) NOT NULL,
CONSTRAINT PK_ContestResults PRIMARY KEY CLUSTERED (ColID)
)
There are three categories: pumpkins, squash, and
watermelon. Most of this is based on real data I found on the Internet, but I
made up some of it to help explore these functions. (Do you recognize the last
five watermelon growers?)
Here is the data:
ColID |
Category |
Weight |
Entrant |
1 |
Pumpkin |
716 |
Chad Johnson |
2 |
Pumpkin |
679 |
George Kopsell |
3 |
Pumpkin |
679 |
Dan Gardner |
4 |
Pumpkin |
481 |
John Suydam |
5 |
Pumpkin |
452 |
Mark Bardin |
6 |
Pumpkin |
442 |
Bill Kallas |
7 |
Pumpkin |
428 |
Theresa Helmer |
8 |
Pumpkin |
426 |
Terry Helmer |
9 |
Pumpkin |
346 |
Gary Spiel |
10 |
Pumpkin |
331 |
Kevin Rabell |
11 |
Pumpkin |
289 |
Jan Spiel |
12 |
Pumpkin |
247 |
Harvey Zale |
13 |
Pumpkin |
229 |
Harvey Zale |
14 |
Squash |
462 |
Dan Gardner |
15 |
Squash |
462 |
Harvey Zale |
16 |
Squash |
435 |
Terry Helmer |
17 |
Squash |
405 |
Gary Spiel |
18 |
Watermelon |
146 |
Mark Bardin |
19 |
Watermelon |
139 |
Christine Daaé |
20 |
Watermelon |
139 |
Carlotta Giudicelli |
21 |
Watermelon |
132 |
Ubaldo Piangi |
22 |
Watermelon |
132 |
Meg Giry |
23 |
Watermelon |
129 |
Joseph Buquet |
The code to build the table is here.
We have been asked to decide who gets first-, second-, and
third-place prizes in this contest. We will use ranking functions to make these
decisions.
RANK() and DENSE_RANK()
If we were asked to give out prizes, the RANK() and
DENSE_RANK() functions would seem to be a reasonable choice. First, let’s find
the simple ranking for the whole set.
RANK() and DENSE_RANK() with ORDER BY
Query:
SELECT
Category,
Weight,
Entrant,
RANK() OVER (
ORDER BY Weight DESC
) AS [Rank],
DENSE_RANK() OVER (
ORDER BY Weight DESC
) AS DenseRank
FROM dbo.ContestResults
Partial Results:
Category Weight Entrant Rank DenseRank
———- —— ——————– —- ———
Pumpkin 716 Chad Johnson 1 1
Pumpkin 679 George Kopsell 2 2
Pumpkin 679 Dan Gardner 2 2
Pumpkin 481 John Suydam 4 3
Squash 462 Dan Gardner 5 4
Squash 462 Harvey Zale 5 4
…
Watermelon 139 Christine Daaé 19 17
Watermelon 139 Carlotta Giudicelli 19 17
Watermelon 132 Ubaldo Piangi 21 18
Watermelon 132 Meg Giry 21 18
Watermelon 129 Joseph Buquet 23 19
Notice that RANK() gives us a low of 1 and high of 23
(matching the number of records). If RANK() is over a column where the last
value is duplicated, the last Rank number will not match number of
records.
These results are interesting, but since we have multiple
categories, this does not give us the answers we need.
RANK() and DENSE_RANK() with PARTITION BY and ORDER BY
Query:
SELECT
Category,
Weight,
Entrant,
RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS [Rank],
DENSE_RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS DenseRank
FROM dbo.ContestResults
Partial Results:
Category Weight Entrant Rank DenseRank
———- ———– ——————- —- ———
Pumpkin 716 Chad Johnson 1 1
Pumpkin 679 George Kopsell 2 2
Pumpkin 679 Dan Gardner 2 2
Pumpkin 481 John Suydam 4 3
…
Squash 462 Dan Gardner 1 1
Squash 462 Harvey Zale 1 1
Squash 435 Terry Helmer 3 2
Squash 405 Gary Spiel 4 3
Watermelon 146 Mark Bardin 1 1
Watermelon 139 Christine Daaé 2 2
Watermelon 139 Carlotta Giudicelli 2 2
Watermelon 132 Ubaldo Piangi 4 3
Watermelon 132 Meg Giry 4 3
Watermelon 129 Joseph Buquet 6 4
Now we have useful results. If the requirement is
that we give out exactly three prizes, then we can pull from the first three
entries in each category using Rank. If the requirement is that we give out
prizes for the top three weights, and we can give more than one prize for any
ties, then we can pull from the entries in each category that have a DenseRank
of 3 or less.
Using PARTITION BY and ORDER BY with a Common Table Expression
Since we need just the top entries in each category, it
would be nice to do something like this (using DENSE_RANK):
SELECT
Category,
Entrant,
DENSE_RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS DenseRank
FROM dbo.ContestResults
WHERE DenseRank <= 3
Well, that doesn’t work (“Invalid column name ‘DenseRank’.”).
I have seen people use table variables to store the results
and then SELECT from the results. The other option is to use a common table
expression. Here are the two versions of the query that do work:
Table Variable:
DECLARE @t TABLE (
Category varchar(10),
Entrant nvarchar(20),
DenseRank bigint
)
INSERT INTO @t (
Category,
Entrant,
DenseRank
)
SELECT
Category,
Entrant,
DENSE_RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS DenseRank
FROM dbo.ContestResults
SELECT *
FROM @t
WHERE DenseRank <= 3
ORDER BY Category, DenseRank, Entrant
Common Table Expression:
WITH t
AS (
SELECT
Category,
Entrant,
DENSE_RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS DenseRank
FROM dbo.ContestResults
)
SELECT
Category,
Entrant,
DenseRank
FROM t
WHERE DenseRank <= 3
ORDER BY Category, DenseRank, Entrant
Results of Either Query:
Category Entrant DenseRank
———- ——————- ———
Pumpkin Chad Johnson 1
Pumpkin Dan Gardner 2
Pumpkin George Kopsell 2
Pumpkin John Suydam 3
Squash Dan Gardner 1
Squash Harvey Zale 1
Squash Terry Helmer 2
Squash Gary Spiel 3
Watermelon Mark Bardin 1
Watermelon Carlotta Giudicelli 2
Watermelon Christine Daaé 2
Watermelon Meg Giry 3
Watermelon Ubaldo Piangi 3
I think the CTE code is much simpler to read. It is
certainly more compact, though that does not always mean it’s easier to read or
maintain.
Execution Plan Differences (Table Variable versus CTE)
What about performance? Yes, I know this is a trivial
example, but the differences are interesting anyway.
The execution plans from these two queries show that the
table variable version is 60% of the batch, where the CTE version is 40% of the
batch. See the graphic plan below:
Using RANK() in the ORDER BY Clause
What if you just want to pull the top three entries in a
single category? Most examples use RANK()in the SELECT clause, but you can also
use RANK() [and DENSE_RANK()] in the ORDER BY clause.
Query:
SELECT TOP 3
Entrant,
Weight
FROM dbo.ContestResults
WHERE Category = ‘Watermelon’
ORDER BY RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
)
Result:
Entrant Weight
——————- ———–
Mark Bardin 146
Carlotta Giudicelli 139
Christine Daaé 139
An important point here is that you would only want to use
ranking functions in the ORDER BY clause of a query if you are using PARTITION
BY. If you left out the PARTITION BY, you would be left with a needlessly complicated
ORDER BY clause. These two queries illustrate this. They produce the exact same
result.
SELECT …
FROM dbo.ContestResults
ORDER BY DENSE_RANK() OVER (ORDER BY Weight DESC)
SELECT …
FROM dbo.ContestResults
ORDER BY Weight DESC
See the queries here
and the graphic plan below.
Using RANK() in the SELECT and ORDER BY Clauses
It is interesting to note that, based on the query plan,
including the Rank in the SELECT list does not incur a penalty. The work is
done once and used twice.
Query:
SELECT TOP 3
Entrant,
Weight,
RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
) AS [Rank]
FROM dbo.ContestResults
WHERE Category = ‘Watermelon’
ORDER BY RANK() OVER (
PARTITION BY Category
ORDER BY Weight DESC
)
Result:
Entrant Weight Rank
——————- —— —-
Mark Bardin 146 1
Carlotta Giudicelli 139 2
Christine Daaé 139 2
See the queries here
and the graphic plan below.
Conclusion
My plan was to cover all of the ranking functions in this
article, but once I got started, there was just too much interesting stuff to
cover. Next month, I will cover NTILE() and ROW_NUMBER().
A Challenge
It is possible to use RANK() with different PARTITION
or ORDER BY in the SELECT clause and ORDER BY clauses, but I was unable to come
up with something that actually produced worthwhile results. Using the sample
table and data presented here, can you come up with a worthwhile SELECT
statement that uses RANK() or DENSE_RANK() in both the SELECT and ORDER BY
clauses where they don’t match? If you can, submit your query and an
explanation through
the forum.