Deleting All But the Most
Current Record
In this example, I will be
showing you some code that will delete the oldest record. Let's say you have
an application that has a business rule that requires that you don't keep more
than a single transaction for each customer in the cust_trans table. When the
original application was built, this was not a business rule, but due to the volume
of customer transactions this single transaction rule is now required. Rather
than go back to every program that appends customer transactions to the cust_tran
table, it was decided that a nightly process would be run to remove the
duplicate transactions for each customer, so that only the most recent
transaction remains in the table.
Now say your current cust_tran
table has the following records.
cust_id tran_date prod_id qty unit_price
----------- ---------------- ----------- ----------- ----------
1 2003-07-01 00:00 98 1 9.99
2 2003-07-01 00:00 22 2 19.99
3 2003-07-01 00:00 18 1 5.99
1 2003-07-02 00:00 7 1 89.99
2 2003-07-01 01:01 15 1 2.99
3 2003-07-02 00:00 1 10 .99
4 2003-07-01 00:00 18 4 5.99
5 2003-07-02 00:00 98 1 9.99
The tran_date column will be
used to determine which record is the most recent entry in the table for each by
cust_id. In this example cust_id 1, 2, and 3 each have multiple records. To
remove the duplicate records for cust_Id 1, 2, and 3 the following logic can be
used.
delete from a
from cust_tran a join
(select cust_id, max(tran_date) max_tran_date from cust_tran
group by cust_id
having count(*) > 1) b
on a.cust_id = b.cust_id and a.tran_date < b.max_tran_date
This code uses a sub-query
and a join to identify those records that need to be deleted. The sub-query
identifies the key values for most recently entered records for customers that
have duplicate records. The rows returned from this sub-query are then joined
with the records in the cust_tran table for each cust_tran that has a tran_date
less than the most recently entered transaction for that cust_id. Upon
completion of this delete statement, there will be no more than a single cust_tran
record for each customer, even if the customer had more than two duplicate
records to start with.
Deleting All the True
Duplicates
As stated early, deleting
true duplicates is the process of removing all but one of the duplicate
records. No decision will be needed to determine which one to delete, since
all records contain exactly the same information. I will show you a couple of
ways to accomplish deleting true duplicates.
In the first delete example,
I will add an identity column to my table containing duplicates. The identity
column will allow me to assign a unique integer value to each row, thus allowing
me to distinguish one duplicate row from the next, and therefore allowing me to
delete all but one of the duplicates. Here is an example on how I used the identity
column to remove duplicates from our earlier vehicles table example.
alter table vehicles
add seq_num int identity
go
delete from a
from vehicles a join
(select make, model, color, max(seq_num) max_seq_num from vehicles
group by make, model, color
having count(*) > 1) b
on a.make = b.make and
a.model = b.model and
a.color = b.color and
a.seq_num < b.max_seq_num
go
alter table vehicles
drop column seq_num
Now another method of
deleting true duplicates is to create a temporary table that will contain all
the original table rows. Next delete all the rows in the original table and
re-populate it, using an INSERT INTO with a SELECT DISTINCT statement. Here
is how I would remove the true duplicates from the vehicles table using this
method.
create table #vehicles (make char(10),
model char(10),
color char(10))
insert into #vehicles select * from vehicles
truncate table vehicles
insert into vehicles select distinct * from #vehicles
drop table #vehicles
Making each record unique
by adding another column to the duplicate key
In our vehicle file example,
we have duplicate records because our duplicate key, which consists of make,
model and color, does not uniquely identify a single record. Four of the
records have just two unique key values. If we alter the table to add a dupnum
column and then populate it with a numerical key value from 1 to N for each key
value, where N is the number of duplicates per key value, then we will be able
to uniquely identify each record. Here is the code necessary to add the dupnum
column and populate it with the proper dupnum value for each key value.
alter table vehicles add seq_num int identity
alter table vehicles add dupnum int
go
update vehicles
set dupnum = rowno
from vehicles d,
(select (select count(*)
from vehicles a
where a.make = b.make and
a.model=b.model and
a.color=b.color and
a.seq_num <= b.seq_num
) as rowno,
b.model,b.make,b.color,seq_num
from vehicles b) c
where c.make = d.make and c.model = d.model and c.color = d.color and c.seq_num=d.seq_num
go
alter table vehicles drop column seq_num
If you review this code, you
will notice that I first add an identity column called seq_num to my vehicles
table. This will allow me to identify one duplicate column from another. In
addition, I added a new column called dupnum to the vehicles table. The dupnum
column is updated by determining the number of values for each key. If there
are no duplications for the key this column is updated with a 1. If there are
duplicate records for a given key value then this column is populated with a 1
for the first duplicate record, 2 for the second duplicate record, 3 for the
third duplicate record and so on. To calculate the sequential number values, I
used a correlated sub-query that determines the number (count(*)) of records
that have a seq_num less than or equal to the current record being processed
that has the same key value.
Conclusion
Duplicate records do come up
occasionally and knowing how to find and deal with the duplicates is important.
The examples I have shown for dealing with duplicates are simple ones. Most likely,
your environment will have duplicate situations that are a little more
complex. I hope that these examples can be tailored to your environment to help
you resolve most of your duplicate record situations.
»
See All Articles by Columnist Gregory A. Larsen