Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum Rss Feed

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
Free Newsletters:

News Via RSS Feed

Rss Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles


Posted Sep 4, 2000

Speeding Up SELECT DISTINCT Queries

By Neil Boyle

Many people use the DISTINCT option in a select statement to filter out duplicate results from a query's output. Take this simple PUBS database query as an example:

  from authors

In a simple select from one table (like the one above) this is the easiest and quickest way of doing things.

However, with a more complex query you can think about re-coding it to gain a performance advantage. Take this example query, which only returns authors that have a book already published.

  from authors a 
    join titleAuthor t
    on   t.au_id = a.au_id

Here we only want to see unique names of authors who have written books. The query will work as required, but we can get a small performance improvement if we write it like this:

select  au_fname,
  from  authors a
  where exists (
    select *
    from   titleAuthor t
    where  t.au_id = a.au_id

The reason the second example runs slightly quicker is that the EXISTS clause will cause a name to be returned when the first book is found, and no further books for that author will be considered (we already have the author's name, and we only want to see it once)

On the other hand, the DISTINCT query returns one copy of the author's name for each book the author has worked on, and the list of authors generated subsequently needs to be examined for duplicates to satisfy the DISTINCT clause.

You can examine the execution plan for each query to see where the performance improvements come from. For example, in SQL 6.5 you will normally see a step involving a Worktable mentioned for the "DISTINCT" version, which does not happen in the EXISTS version. In SQL 7 you can generate a graphical execution plan for the two queries and more easily compare them.

The performance improvement you get depends on the ratio of matching rows in the left and right (or inner and outer) tables. The query below will work in any SQL Server database. Try pasting the two queries into ISQL or Query Analyser and comparing the execution plan and I/O costs the two produce in different databases. The second query usually comes out as more efficient, though the actual performance gain varies.

select DISTINCT o.name
  from   sysobjects o
  join   sysindexes i
  on     o.id = i.id
  where  o.type = 'U'

select   o.name
  from   sysobjects o
  where  o.type = 'U'
  and    exists (
      select 1
      from   sysindexes i
      where  o.id = i.id

You need to understand the relationship between the two (or more) tables you are joining in order to execute this trick properly. The two Northwind database queries below are designed to return customer IDs where a discount of more than 2 per cent has been given on any item. At first sight like they should produce the same results because they follow the format in the examples above, but the results you get are actually different in this case.

select DISTINCT customerID
  from  orders o
 join   [order details] od
  on    o.OrderID = od.OrderID
  where discount > 0.02

select customerID
  from  orders o
  where exists (
      select *
      from   [order details] od
      where  o.OrderID = od.OrderID
      and    discount > 0.02

These examples do not match up because it is OrderID that defines the relationship between the two tables, not the customer name. The second query will return multiple customer names--one for each order placed by the customer. Try adding the OrderID column into the SELECT list to see this.

MS SQL Archives

Latest Forum Threads
MS SQL Forum
Topic By Replies Updated
SQL 2005: SSIS: Error using SQL Server credentials poverty 3 August 17th, 07:43 AM
Need help changing table contents nkawtg 1 August 17th, 03:02 AM
SQL Server Memory confifuration bhosalenarayan 2 August 14th, 05:33 AM
SQL Server Primary Key and a Unique Key katty.jonh 2 July 25th, 10:36 AM