Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MS SQL

Posted Jul 16, 2003

Dealing with MS SQL Tables that contain Duplicate Rows - Page 3

By Gregory A. Larsen

Deleting All But the Most Current Record

In this example, I will be showing you some code that will delete the oldest record. Let's say you have an application that has a business rule that requires that you don't keep more than a single transaction for each customer in the cust_trans table. When the original application was built, this was not a business rule, but due to the volume of customer transactions this single transaction rule is now required. Rather than go back to every program that appends customer transactions to the cust_tran table, it was decided that a nightly process would be run to remove the duplicate transactions for each customer, so that only the most recent transaction remains in the table.

Now say your current cust_tran table has the following records.

cust_id     tran_date        prod_id     qty         unit_price 
----------- ---------------- ----------- ----------- ---------- 
1           2003-07-01 00:00 98          1           9.99
2           2003-07-01 00:00 22          2           19.99
3           2003-07-01 00:00 18          1           5.99
1           2003-07-02 00:00 7           1           89.99
2           2003-07-01 01:01 15          1           2.99
3           2003-07-02 00:00 1           10          .99
4           2003-07-01 00:00 18          4           5.99
5           2003-07-02 00:00 98          1           9.99

The tran_date column will be used to determine which record is the most recent entry in the table for each by cust_id. In this example cust_id 1, 2, and 3 each have multiple records. To remove the duplicate records for cust_Id 1, 2, and 3 the following logic can be used.

delete from a
from cust_tran a join 
     (select cust_id, max(tran_date) max_tran_date from cust_tran 
             group by cust_id 
             having count(*) > 1) b
      on a.cust_id = b.cust_id and a.tran_date < b.max_tran_date

This code uses a sub-query and a join to identify those records that need to be deleted. The sub-query identifies the key values for most recently entered records for customers that have duplicate records. The rows returned from this sub-query are then joined with the records in the cust_tran table for each cust_tran that has a tran_date less than the most recently entered transaction for that cust_id. Upon completion of this delete statement, there will be no more than a single cust_tran record for each customer, even if the customer had more than two duplicate records to start with.

Deleting All the True Duplicates

As stated early, deleting true duplicates is the process of removing all but one of the duplicate records. No decision will be needed to determine which one to delete, since all records contain exactly the same information. I will show you a couple of ways to accomplish deleting true duplicates.

In the first delete example, I will add an identity column to my table containing duplicates. The identity column will allow me to assign a unique integer value to each row, thus allowing me to distinguish one duplicate row from the next, and therefore allowing me to delete all but one of the duplicates. Here is an example on how I used the identity column to remove duplicates from our earlier vehicles table example.

alter table vehicles 
  add seq_num int identity
go
delete from a
from vehicles a join 
     (select make, model, color, max(seq_num) max_seq_num from vehicles 
             group by make, model, color
             having count(*) > 1) b
      on a.make = b.make and 
         a.model = b.model and
         a.color = b.color and 
         a.seq_num < b.max_seq_num
go 
alter table vehicles
 drop column seq_num

Now another method of deleting true duplicates is to create a temporary table that will contain all the original table rows. Next delete all the rows in the original table and re-populate it, using an INSERT INTO with a SELECT DISTINCT statement. Here is how I would remove the true duplicates from the vehicles table using this method.

create table #vehicles (make  char(10), 
                        model char(10), 
                        color char(10))
insert into #vehicles select * from vehicles
truncate table vehicles
insert into vehicles select distinct * from #vehicles
drop table #vehicles

Making each record unique by adding another column to the duplicate key

In our vehicle file example, we have duplicate records because our duplicate key, which consists of make, model and color, does not uniquely identify a single record. Four of the records have just two unique key values. If we alter the table to add a dupnum column and then populate it with a numerical key value from 1 to N for each key value, where N is the number of duplicates per key value, then we will be able to uniquely identify each record. Here is the code necessary to add the dupnum column and populate it with the proper dupnum value for each key value.

alter table vehicles add seq_num int identity
alter table vehicles add dupnum int	
go
 
update vehicles 
set dupnum = rowno 
from vehicles d, 
     (select (select count(*) 
        	from vehicles a
        	where a.make = b.make and
                      a.model=b.model and 
                      a.color=b.color and 
                      a.seq_num <= b.seq_num
			) as rowno, 
          b.model,b.make,b.color,seq_num
       
       from vehicles b) c 
  where c.make = d.make and c.model = d.model and c.color = d.color and c.seq_num=d.seq_num
go
alter table vehicles drop column seq_num

If you review this code, you will notice that I first add an identity column called seq_num to my vehicles table. This will allow me to identify one duplicate column from another. In addition, I added a new column called dupnum to the vehicles table. The dupnum column is updated by determining the number of values for each key. If there are no duplications for the key this column is updated with a 1. If there are duplicate records for a given key value then this column is populated with a 1 for the first duplicate record, 2 for the second duplicate record, 3 for the third duplicate record and so on. To calculate the sequential number values, I used a correlated sub-query that determines the number (count(*)) of records that have a seq_num less than or equal to the current record being processed that has the same key value.

Conclusion

Duplicate records do come up occasionally and knowing how to find and deal with the duplicates is important. The examples I have shown for dealing with duplicates are simple ones. Most likely, your environment will have duplicate situations that are a little more complex. I hope that these examples can be tailored to your environment to help you resolve most of your duplicate record situations.

» See All Articles by Columnist Gregory A. Larsen



MS SQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MS SQL Forum
Topic By Replies Updated
SQL 2005: SSIS: Error using SQL Server credentials poverty 3 August 17th, 07:43 AM
Need help changing table contents nkawtg 1 August 17th, 03:02 AM
SQL Server Memory confifuration bhosalenarayan 2 August 14th, 05:33 AM
SQL Server Primary Key and a Unique Key katty.jonh 2 July 25th, 10:36 AM


















Thanks for your registration, follow us on our social networks to keep up-to-date