Dealing with MS SQL Tables that contain Duplicate Rows
July 16, 2003
Every so often, you might have to deal with tables that contain duplicate rows. In one case, you might only need to identify the duplicate rows. In other cases, you might need to remove the duplicate rows. This article will show you some different techniques for dealing with duplicate rows.
Your interpretation of a duplicate row might mean something different then the meaning to someone else. A duplicate row could mean that the data that makes up the key for each row is duplicated on more than one row. Another definition might be every column in a row matches with another row in the same table. I will show you how to find the duplicate rows in both of these cases.
Identifying Duplicate Key Values
One of the first things you will need consider when dealing with duplicates is how to identify the rows that contain duplicate values. To identify duplicates, you need to determine what is considered a duplicate record.
For the purpose of my first example, I will be using the sales table in the pubs database. The sales table has the following columns: stor_id, ord_num, ord_date, qty, payterms, title_id. For this example, let's say a duplicate record is any record that has the same stor_id and ord_num. To identify duplicates sales records I would use the following code:
select stor_id, ord_num , count(*) as 'Num of Dups' from pubs.dbo.sales group by stor_id, ord_num having count(*) > 1
When this code is run I get the following results:
stor_id ord_num Num of Dups ------- -------------------- ----------- 7067 P2121 3 7131 P3087a 4
In this example, I used the HAVING clause in conjunction with the GROUP BY to identify the number of duplicate records. In this case, I only showed the key values (stor_id, and ord_num) associated with the duplicate records along with the number of duplicates. Now in reality these are not duplicate records if you looked at all the columns in the table, but since I said we were only looking at stor_id and ord_num, these are considered duplicates.
Displaying Rows for Records that have Duplicate Key Values
Now my previous example only displayed the duplicate key values. I assume some of the time you may want to display all the duplicate records associated with duplicate key values. Here is some code that will display all of the rows for each duplicate key value.
select cast(a.stor_id as char(4)) as stor_id, cast(a.ord_num as char(6)) as ord_num, convert(char(8),a.ord_date,110) as ord_date, a.qty, a.payterms, a.title_id from pubs.dbo.sales a join (select stor_id, ord_num from pubs.dbo.sales group by stor_id, ord_num having count(*) > 1) b on a.stor_id = b.stor_id and a.ord_num = b.ord_num order by a.stor_id, a.ord_num
For this example, I took my previous TSQL command that returned the duplicate keys (stor_id and ord_num) and joined it with the pubs.dbo.sales table. The join criteria allow TSQL to only return rows in the pubs.dbo.sales table that match the duplicate key values. The following output is displayed in the Query Analyzer results pane when I run this query:
stor_id ord_num ord_date qty payterms title_id ------- ------- -------- ------ ------------ -------- 7067 P2121 06-15-19 40 Net 30 TC3218 7067 P2121 06-15-19 20 Net 30 TC4203 7067 P2121 06-15-19 20 Net 30 TC7777 7131 P3087a 05-29-19 20 Net 60 PS1372 7131 P3087a 05-29-19 25 Net 60 PS2106 7131 P3087a 05-29-19 15 Net 60 PS3333 7131 P3087a 05-29-19 25 Net 60 PS7777