Many people use the DISTINCT option in a select statement to filter out
duplicate results from a query’s output. Take this simple PUBS database query as
an example:
select DISTINCT
au_fname,
au_lname
from authors
In a simple select from one table (like the one above) this is the easiest
and quickest way of doing things.
However, with a more complex query you can think about re-coding it to gain a
performance advantage. Take this example query, which only returns authors that
have a book already published.
select DISTINCT
au_fname,
au_lname
from authors a
join titleAuthor t
on t.au_id = a.au_id
Here we only want to see unique names of authors who have written books. The
query will work as required, but we can get a small performance improvement if
we write it like this:
select au_fname,
au_lname
from authors a
where exists (
select *
from titleAuthor t
where t.au_id = a.au_id
)
The reason the second example runs slightly quicker is that the EXISTS clause
will cause a name to be returned when the first book is found, and no further
books for that author will be considered (we already have the author’s name,
and we only want to see it once)
On the other hand, the DISTINCT query returns one copy of the author’s name
for each book the author has worked on, and the list of authors generated
subsequently needs to be examined for duplicates to satisfy the DISTINCT
clause.
You can examine the execution plan for each query to see where the
performance improvements come from. For example, in SQL 6.5 you will normally
see a step involving a Worktable mentioned for the "DISTINCT" version,
which does not happen in the EXISTS version. In SQL 7 you can generate a
graphical execution plan for the two queries and more easily compare them.
The performance improvement you get depends on the ratio of matching rows in
the left and right (or inner and outer) tables. The query below will work in any
SQL Server database. Try pasting the two queries into ISQL or Query Analyser and
comparing the execution plan and I/O costs the two produce in different
databases. The second query usually comes out as more efficient, though the
actual performance gain varies.
select DISTINCT o.name
from sysobjects o
join sysindexes i
on o.id = i.id
where o.type = 'U'
select o.name
from sysobjects o
where o.type = 'U'
and exists (
select 1
from sysindexes i
where o.id = i.id
)
You need to understand the relationship between the two (or more) tables you
are joining in order to execute this trick properly. The two Northwind database
queries below are designed to return customer IDs where a discount of more than
2 per cent has been given on any item. At first sight like they should produce
the same results because they follow the format in the examples above, but the
results you get are actually different in this case.
select DISTINCT customerID
from orders o
join [order details] od
on o.OrderID = od.OrderID
where discount > 0.02
select customerID
from orders o
where exists (
select *
from [order details] od
where o.OrderID = od.OrderID
and discount > 0.02
)
These examples do not match up because it is OrderID that defines the
relationship between the two tables, not the customer name. The second query
will return multiple customer names–one for each order placed by the
customer. Try adding the OrderID column into the SELECT list to see this.