Eliminating Duplicate Rows from MySQL Result Sets | Database Journal

Eliminating Duplicate Rows from MySQL Result Sets

Nov 12, 2010
6 minute read

The go to solution for removing duplicate rows from your result sets is to include the distinct keyword in your select statement. It tells the query engine to remove duplicates to produce a result set in which every row is unique. Did you know that the group by clause can also be used to remove duplicates? If not, read on to find out what the main differences are between them and which to use to produce a desired result.

The Distinct and Distinctrow Keywords

The distinct keyword comes directly after the SELECT in the query statement and replaces the optional all keyword, which is the default. Distinctrow is an alias for distinct and produces the exact same results:

    SELECT [ALL | DISTINCT | DISTINCTROW ]
    select_expr
    [FROM table_references
    [WHERE where_condition]

To illustrate how it works, let’s select some data from the following table, which contains a list of fruits and their colors:

namecolor
applered
applegreen
appleyellow
bananayellow
bananagreen
grapered
grapewhite

The following query will retrieve all the fruit names from the table and list them in alphabetical order:

SELECT name
FROM fruits;

Without the color information, we have multiples of each fruit type:

name
apple
apple
apple
banana
banana
grape
grape

Now let’s try the query again with the distinct keyword:

SELECT DISTINCT name
FROM fruits;

As expected, we now have only one instance of each fruit type:

name
apple
banana
grape

If only it were always that easy! A quick Internet search on the phrase “sql eliminating duplicates” shows that there’s more to removing duplicate values than inserting the distinct keyword into your SELECT statements.

When are Duplicate Rows Not Duplicate Rows

One problem that the distinct keyword does nothing to solve is that sometimes removing duplicates creates misleading results. Observe the following scenario:

The client wants to generate a list of their employees to generate some statistics. Here’s some SQL to do that:

SELECT name,
       gender,       
       salary
FROM employees
ORDER BY name;

Strangely, this produces duplicate rows for “Kristen Ruegg”:

Namegendersalary
Allan Smithiem4900
Barbara Breitenmoserf(NULL)
Jon Simpsonm4500
Kirsten Rueggf5600
Kristen Rueggf5600
Peter Jonsonm5200
Ralph Tellerm5100

The client responds that they don’t want duplicates, so the developer adds the trusty distinct keyword to the SELECT statement. This produces the desired results, except for one small detail: There are two employees with the same name! Adding the distinct keyword created incorrect results by removing a valid row. Including the unique emp_id_number to the field list confirms that there are indeed two Kristen Rueggs:

SELECT name,
       gender,       
       salary,
       emp_id_number
FROM employees
ORDER BY name;

Here’s the data in question showing the unique emp_id_numbers:

namegendersalaryemp_id_number
Kirsten Rueggf56003462
Kristen Rueggf56002223

The moral of the story is this: When using the distinct keyword, be sure that you aren’t inadvertently removing valid data!

Advertisement

Comparing Distinct to Group By

Using distinct is logically equivalent to using group by on all selected columns with no aggregate function. For such a query, group by just produces a list of distinct grouping values. When displaying and grouping by a single column, the query produces the distinct values in that column. However, if you display and group by multiple columns, the query produces the distinct combinations of values in each column. For example, the following query produces the same set of rows as our first SELECT distinct did:

SELECT name 
FROM fruits 
GROUP BY name;

Similarly, the following statement produces the same results as our SELECT distinct did on the employees table:

SELECT name,
       gender,       
       salary 
FROM employees
GROUP BY name;

A difference between distinct and group by is that group by causes row sorting. Hence:

SELECT name,
       gender,       
       salary 
FROM employees
GROUP BY name;

…is the same as:

SELECT DISTINCT name,
                gender,       
                salary 
FROM employees
ORDER BY name;

Counting Duplicates

Distinct can be used with the COUNT() function to count how many distinct values a column contains. COUNT(distinct expression) counts the number of distinct (unique) non-NULL values of the given expression. The expression can be a column name to count the number of distinct non-NULL values in the column.

Here’s the full employee table data:

iddept_idgendernamesalaryemp_id_number
12mJon Simpson45001234
24fBarbara Breitenmoser(NULL)9999
33fKirsten Ruegg56003462
41mRalph Teller51006543
52mPeter Jonson52009747
62mAllan Smithie49006853
74fKirsten Ruegg56002223
83fKirsten Ruegg44002765

Applying the Count distinct function on the name field produces six unique names:

SELECT Count(DISTINCT name)
FROM employees;
Count(DISTINCT name)
6

It’s also possible to give a list of expressions separated by commas. In this case, COUNT() returns the number of distinct combinations of values that contain no NULL values. The following query counts the number of distinct rows for which neither the name nor salary is NULL:

SELECT Count (DISTINCT name, salary)
FROM employees;
Count(DISTINCT name, salary)
6

You can also group counts of duplicates per group using a bit of math in conjunction with the group by clause. Here’s a query to count duplicated names for each department:

SELECT dept_id, 
       COUNT(*) - COUNT(DISTINCT name) AS 'duplicate names'
FROM   employees 
GROUP BY dept_id;
dept_idduplicate names
10
20
31
40

These queries help you characterize the extent of duplicates, but don’t show you which values are duplicated. To see which names are duplicated in the employees table, use a summary query that displays the non-unique values along with the counts:

   SELECT dept_id, 
          name, 
          count(name) as name_count
   FROM   employees 
   GROUP BY name, 
            dept_id;
dept_idnamename_count
2Allan Smithie1
4Barbara Breitenmoser1
2Jon Simpson1
3Kirsten Ruegg2
4Kirsten Ruegg1
2Peter Jonson1
1Ralph Teller1

Since we’re only interested in duplicates, we can filter out everything else using the HAVING clause. It’s like a WHERE clause, except that it’s used with group by to narrow down the results:

SELECT dept_id, 
       name, 
       count(name) as name_count
FROM   employees 
GROUP BY name, 
       dept_id
HAVING name_count > 1;

Now we can see which names are duplicated, as well as how many there are:

dept_idnamename_count
3Kirsten Ruegg2
Advertisement

Displaying Per-Group Minimum or Maximum Values in Duplicated Rows

As we saw in the last example, the group by clause causes aggregate functions to be applied for each unique value in the field list. You should be aware that columns that are not in the group by field list do not necessarily belong to the same row as the aggregated values! An example is definitely in order here. The following query displays the highest salary for each department:

SELECT dept_id,
       name,
       gender,       
       max(salary) as max_salary 
FROM   employees
GROUP BY dept_id;

The intention is to also display information about the individual who earns the highest salary. However, that is not what is returned here:

dept_idnamegendermax_salary
1Ralph Tellerm5100
2Jon Simpsonm5200
3Kirsten Rueggf5600
4Barbara Breitenmoserf5600

The problem is that the salary is the only aggregated field because the Max() aggregate function is applied to it. Consequently, the first name and gender values encountered for each group by field are what are displayed. Looking at the table, you’ll see that, while Ralph Teller is the only member of department 1, Jon Simpson only earned $4500. Peter Jonson is really the owner of that distinction, but the query engine selected the first name and gender that it came across having a dept_id of 2.

The solution is to join the GROUP_BY results with the original table using the grouped fields. In this case, we only have one field, and that is the salary:

SELECT emp2.dept_id, 
       emp1.name, 
       emp1.gender, 
       emp2.max_salary
FROM (
  SELECT dept_id,       
         Max(salary) as max_salary 
  FROM   employees 
  GROUP BY dept_id
) as emp2 JOIN employees as emp1 ON emp1.salary = emp2.max_salary
GROUP BY dept_id;

Now the name and gender fields belong to the earner of the greatest salary:

dept_idnamegendermax_salary
1Ralph Tellerm5100
2Peter Jonsonm5200
3Kirsten Rueggf5600
4Kirsten Rueggf5600

There are other techniques that were not covered, such as the use of temporary tables and dynamic SQL. Here is more in-depth information on removing duplicate records. This article discusses the group by and HAVING clauses in more detail.

» See All Articles by Columnist Rob Gravelle

Robert Gravelle

Rob Gravelle resides in Ottawa, Canada, and has been an IT guru for over 20 years. In that time, Rob has built systems for intelligence-related organizations such as Canada Border Services and various commercial businesses. In his spare time, Rob has become an accomplished music artist with several CDs and digital releases to his credit.

Database Journal Logo

DatabaseJournal.com publishes relevant, up-to-date and pragmatic articles on the use of database hardware and management tools and serves as a forum for professional knowledge about proprietary, open source and cloud-based databases--foundational technology for all IT systems. We publish insightful articles about new products, best practices and trends; readers help each other out on various database questions and problems. Database management systems (DBMS) and database security processes are also key areas of focus at DatabaseJournal.com.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.