Identifying truly
duplicate Rows
My definition of a truly
duplicate row is a row where the column values of the row have exactly the same
values for each column with another record in the same table. To identify true
duplicate records you will need to have every column in the table in a group by
clause. Let me run through an example that shows true duplicates.
Say you have an auto sales
company. The sales people always want to know how many vehicles you have on
your lot that are exactly the same. Basically, duplicate records in your
vehicles file. Let's say your vehicles file looks like this:
make model color
---------- ---------- ----------
Ford Ranger Red
Ford Ranger Green
Toyota Altima Red
Toyota Altima Green
Toyota Altima Red
Toyota Altima Blue
Toyota Altima Blue
Plymouth Voyager Blue
Plymouth Voyager White
Plymouth Voyager Green
Plymouth Voyager Red
Plymouth Voyager Gold
To find the duplicate
records I could use the following code. This code is a little more complicated
than it needs to be for this example. This complexity is because I built this
code to be dynamic, so it will work on any table you specify. This code queries
the information_schema.columns view to return the column names for all columns
in the specified table. The code uses the column names to dynamically build
the select statement. Writing the code this way allows the code to be used for
finding true duplicates in any table. By changing the value of the @TABLE
variable this code will find true duplicates in whichever table you specify.
declare @cmd varchar(4000)
declare @table varchar(100)
declare @curr_col varchar(100)
declare @old_col varchar(100)
declare @column_names varchar(4000)
-- Set the table to look for duplicates
set @table = 'vehicles'
set @curr_col = ''
-- Get name of first column
select top 1 @curr_col=column_name
from information_schema.columns
where table_name = @table order by column_name
set @column_names = @curr_col
set @old_col = @curr_col
-- Get name of second column
select top 1 @curr_col=column_name
from information_schema.columns
where table_name = @table
and
column_name > @old_col
order by column_name
-- Process all columns
while @curr_col <> @old_col
begin
set @column_names = rtrim(@column_names) + ',' + rtrim(@curr_col)
set @old_col = @curr_col
-- Get next column
select top 1 @curr_col=column_name
from information_schema.columns
where table_name = @table
and
column_name > @old_col
order by column_name
end
-- build the command to search for duplicates
set @cmd = 'select * from ' + rtrim(@table) +
' group by ' + rtrim(@column_names) +
' having count(*) > 1'
-- Find duplicates
exec (@cmd)
drop table vehicles
The results of the query
show the sale people that following vehicles are identical:
make model color
---------- ---------- ----------
Toyota Altima Blue
Toyota Altima Red
What to do with Duplicate
Records
Now that you know there are
duplicates in your table, the question is what to do with them. Depending on
what caused the duplicates might indicate what will be done with them. In my
vehicles example, the business logic allowed for duplicate records. Although
for other situations possible duplicates should not be allowed, and you will
need to decide what to do with the duplicates. I will take you through a few
examples on what to do with duplicated records to eliminate the duplicates.
In order to delete duplicate
records you need to determine which duplicate to delete. Now in the case where
we have true duplicates, like my vehicle example above, you really don't have
to determine which record to delete, since all duplicate rows are exactly the
same. In my other example where the extent of duplication is only associated
with the duplicate key values, you will need to determine which one to keep. Do
you keep the first one, the last one? How do you identify the first and last
one if there is no datetime field helping you determine which one is first and
which one is last? Remember ordinal position in SQL Server can change, as
records are deleted and added. Let's review a few examples that might help you
when deleting duplicates in your environment.