The go to solution
for removing duplicate rows from your result sets is to include the distinct
keyword in your select statement. It tells the query engine to remove
duplicates to produce a result set in which every row is unique. Did you know
that the group by clause can also be used to remove duplicates? If not,
read on to find out what the main differences are between them and which to use
to produce a desired result.
The Distinct and Distinctrow Keywords
The distinct
keyword comes directly after the SELECT in the query
statement and replaces the optional all keyword, which is the default. Distinctrow
is an alias for distinct and produces the exact same results:
SELECT [ALL | DISTINCT | DISTINCTROW ]
select_expr
[FROM table_references
[WHERE where_condition]
To
illustrate how it works, lets select some data from the following table, which
contains a list of fruits and their colors:
|
name
|
color
|
|
apple
|
red
|
|
apple
|
green
|
|
apple
|
yellow
|
|
banana
|
yellow
|
|
banana
|
green
|
|
grape
|
red
|
|
grape
|
white
|
The
following query will retrieve all the fruit names from the table and list them
in alphabetical order:
SELECT name
FROM fruits;
Without
the color information, we have multiples of each fruit type:
|
name
|
|
apple
|
|
apple
|
|
apple
|
|
banana
|
|
banana
|
|
grape
|
|
grape
|
Now lets
try the query again with the distinct keyword:
SELECT DISTINCT name
FROM fruits;
As
expected, we now have only one instance of each fruit type:
If only it
were always that easy! A quick Internet search on the phrase sql eliminating
duplicates shows that theres more to removing duplicate values than inserting
the distinct keyword into your SELECT statements.
When are Duplicate Rows Not Duplicate Rows
One problem
that the distinct keyword does nothing to solve is that sometimes
removing duplicates creates misleading results. Observe the following scenario:
The client
wants to generate a list of their employees to generate some statistics. Heres
some SQL to do that:
SELECT name,
gender,
salary
FROM employees
ORDER BY name;
Strangely,
this produces duplicate rows for Kristen Ruegg:
|
Name
|
gender
|
salary
|
|
Allan
Smithie
|
m
|
4900
|
|
Barbara
Breitenmoser
|
f
|
(NULL)
|
|
Jon
Simpson
|
m
|
4500
|
|
Kirsten Ruegg
|
f
|
5600
|
|
Kristen Ruegg
|
f
|
5600
|
|
Peter
Jonson
|
m
|
5200
|
|
Ralph
Teller
|
m
|
5100
|
The client
responds that they dont want duplicates, so the developer adds the trusty distinct
keyword to the SELECT statement. This produces the desired results, except for
one small detail: There are two employees with the same name! Adding
the distinct keyword created incorrect results by removing a valid row.
Including the unique emp_id_number to the field list confirms that there
are indeed two Kristen Rueggs:
SELECT name,
gender,
salary,
emp_id_number
FROM employees
ORDER BY name;
Heres the
data in question showing the unique emp_id_numbers:
|
name
|
gender
|
salary
|
emp_id_number
|
|
Kirsten Ruegg
|
f
|
5600
|
3462
|
|
Kristen Ruegg
|
f
|
5600
|
2223
|
The moral
of the story is this: When using the distinct keyword, be sure that you
arent inadvertently removing valid data!
Comparing Distinct to Group By
Using distinct is logically equivalent to using group by on all
selected columns with no aggregate function. For such a query, group by
just produces a list of distinct grouping values. When displaying and grouping
by a single column, the query produces the distinct values in that column. However,
if you display and group by multiple columns, the query produces the distinct
combinations of values in each column. For example, the following query
produces the same set of rows as our first SELECT distinct did:
SELECT name
FROM fruits
GROUP BY name;
Similarly, the following
statement produces the same results as our SELECT distinct did on the employees
table:
SELECT name,
gender,
salary
FROM employees
GROUP BY name;
A difference between distinct
and group by is that group by causes row sorting. Hence:
SELECT name,
gender,
salary
FROM employees
GROUP BY name;
...is the
same as:
SELECT DISTINCT name,
gender,
salary
FROM employees
ORDER BY name;
Counting Duplicates
Distinct can be
used with the COUNT() function to count how many distinct values a column
contains. COUNT(distinct expression) counts the number of distinct
(unique) non-NULL values of the given expression. The expression can be a
column name to count the number of distinct non-NULL values in the column.
Heres the full employee
table data:
|
id
|
dept_id
|
gender
|
name
|
salary
|
emp_id_number
|
|
1
|
2
|
m
|
Jon
Simpson
|
4500
|
1234
|
|
2
|
4
|
f
|
Barbara
Breitenmoser
|
(NULL)
|
9999
|
|
3
|
3
|
f
|
Kirsten
Ruegg
|
5600
|
3462
|
|
4
|
1
|
m
|
Ralph
Teller
|
5100
|
6543
|
|
5
|
2
|
m
|
Peter
Jonson
|
5200
|
9747
|
|
6
|
2
|
m
|
Allan
Smithie
|
4900
|
6853
|
|
7
|
4
|
f
|
Kirsten
Ruegg
|
5600
|
2223
|
|
8
|
3
|
f
|
Kirsten
Ruegg
|
4400
|
2765
|
Applying the Count
distinct function on the name field produces six unique names:
SELECT Count(DISTINCT name)
FROM employees;
It's also possible to
give a list of expressions separated by commas. In this case, COUNT() returns
the number of distinct combinations of values that contain no NULL values. The
following query counts the number of distinct rows for which neither the name
nor salary is NULL:
SELECT Count (DISTINCT name, salary)
FROM employees;
|
Count(DISTINCT
name, salary)
|
|
6
|
You can also group
counts of duplicates per group using a bit of math in conjunction with the group
by clause. Heres a query to count duplicated names for each department:
SELECT dept_id,
COUNT(*) - COUNT(DISTINCT name) AS 'duplicate names'
FROM employees
GROUP BY dept_id;
|
dept_id
|
duplicate names
|
|
1
|
0
|
|
2
|
0
|
|
3
|
1
|
|
4
|
0
|
These
queries help you characterize the extent of duplicates, but don't show you
which values are duplicated. To see which names are duplicated in the employees
table, use a summary query that displays the non-unique values along with the
counts:
SELECT dept_id,
name,
count(name) as name_count
FROM employees
GROUP BY name,
dept_id;
|
dept_id
|
name
|
name_count
|
|
2
|
Allan
Smithie
|
1
|
|
4
|
Barbara
Breitenmoser
|
1
|
|
2
|
Jon
Simpson
|
1
|
|
3
|
Kirsten
Ruegg
|
2
|
|
4
|
Kirsten
Ruegg
|
1
|
|
2
|
Peter
Jonson
|
1
|
|
1
|
Ralph
Teller
|
1
|
Since
were only interested in duplicates, we can filter out everything else using
the HAVING clause. Its like a WHERE clause, except that its used with group
by to narrow down the results:
SELECT dept_id,
name,
count(name) as name_count
FROM employees
GROUP BY name,
dept_id
HAVING name_count > 1;
Now we can see which
names are duplicated, as well as how many there are:
|
dept_id
|
name
|
name_count
|
|
3
|
Kirsten
Ruegg
|
2
|
Displaying Per-Group Minimum or Maximum Values in Duplicated Rows
As we saw in the last
example, the group by clause causes aggregate functions to be applied
for each unique value in the field list. You should be aware that columns that
are not in the group by field list do not necessarily belong to the same
row as the aggregated values! An example is definitely in order here. The
following query displays the highest salary for each department:
SELECT dept_id,
name,
gender,
max(salary) as max_salary
FROM employees
GROUP BY dept_id;
The
intention is to also display information about the individual who earns the highest
salary. However, that is not what is returned here:
|
dept_id
|
name
|
gender
|
max_salary
|
|
1
|
Ralph
Teller
|
m
|
5100
|
|
2
|
Jon
Simpson
|
m
|
5200
|
|
3
|
Kirsten
Ruegg
|
f
|
5600
|
|
4
|
Barbara
Breitenmoser
|
f
|
5600
|
The
problem is that the salary is the only aggregated field because the Max()
aggregate function is applied to it. Consequently, the first name and gender
values encountered for each group by field are what are displayed.
Looking at the table, youll see that, while Ralph Teller is the only member of
department 1, Jon Simpson only earned $4500. Peter Jonson is really the
owner of that distinction, but the query engine selected the first name
and gender that it came across having a dept_id of 2.
The
solution is to join the GROUP_BY results with the original table using the
grouped fields. In this case, we only have one field, and that is the salary:
SELECT emp2.dept_id,
emp1.name,
emp1.gender,
emp2.max_salary
FROM (
SELECT dept_id,
Max(salary) as max_salary
FROM employees
GROUP BY dept_id
) as emp2 JOIN employees as emp1 ON emp1.salary = emp2.max_salary
GROUP BY dept_id;
Now the name
and gender fields belong to the earner of the greatest salary:
|
dept_id
|
name
|
gender
|
max_salary
|
|
1
|
Ralph
Teller
|
m
|
5100
|
|
2
|
Peter
Jonson
|
m
|
5200
|
|
3
|
Kirsten
Ruegg
|
f
|
5600
|
|
4
|
Kirsten
Ruegg
|
f
|
5600
|
There are
other techniques that were not covered, such as the use of temporary tables and
dynamic SQL. Here is more
in-depth information on removing duplicate records. This
article discusses the group by and HAVING clauses in more detail.
»
See All Articles by Columnist
Rob Gravelle