Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
Free Newsletters:

News Via RSS Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles


Posted Jul 17, 2003

Dealing with MS SQL Tables that contain Duplicate Rows - Page 2

By Gregory A. Larsen

Identifying truly duplicate Rows

My definition of a truly duplicate row is a row where the column values of the row have exactly the same values for each column with another record in the same table. To identify true duplicate records you will need to have every column in the table in a group by clause. Let me run through an example that shows true duplicates.

Say you have an auto sales company. The sales people always want to know how many vehicles you have on your lot that are exactly the same. Basically, duplicate records in your vehicles file. Let's say your vehicles file looks like this:

make       model      color      
---------- ---------- ---------- 
Ford       Ranger     Red       
Ford       Ranger     Green     
Toyota     Altima     Red       
Toyota     Altima     Green     
Toyota     Altima     Red       
Toyota     Altima     Blue      
Toyota     Altima     Blue      
Plymouth   Voyager    Blue      
Plymouth   Voyager    White     
Plymouth   Voyager    Green     
Plymouth   Voyager    Red       
Plymouth   Voyager    Gold      

To find the duplicate records I could use the following code. This code is a little more complicated than it needs to be for this example. This complexity is because I built this code to be dynamic, so it will work on any table you specify. This code queries the information_schema.columns view to return the column names for all columns in the specified table. The code uses the column names to dynamically build the select statement. Writing the code this way allows the code to be used for finding true duplicates in any table. By changing the value of the @TABLE variable this code will find true duplicates in whichever table you specify.

declare @cmd varchar(4000)
declare @table varchar(100)
declare @curr_col varchar(100)
declare @old_col varchar(100)
declare @column_names varchar(4000)
-- Set the table to look for duplicates
set @table = 'vehicles'
set @curr_col = ''
-- Get name of first column 
select top 1 @curr_col=column_name   
  from information_schema.columns 
  where table_name = @table order by column_name
set @column_names = @curr_col
set @old_col = @curr_col
-- Get name of second column
select top 1 @curr_col=column_name 
  from information_schema.columns 
  where table_name = @table 
        column_name > @old_col 
  order by column_name
-- Process all columns
while @curr_col <> @old_col
  set @column_names = rtrim(@column_names) + ',' + rtrim(@curr_col)
  set @old_col = @curr_col
  -- Get next column 
  select top 1 @curr_col=column_name 
    from information_schema.columns 
    where table_name = @table 
          column_name > @old_col 
    order by column_name
-- build the command to search for duplicates
set @cmd = 'select * from ' + rtrim(@table) +
           ' group by ' + rtrim(@column_names) + 
           ' having count(*) > 1'
-- Find duplicates
exec (@cmd)
drop table vehicles

The results of the query show the sale people that following vehicles are identical:

make       model      color      
---------- ---------- ---------- 
Toyota     Altima     Blue      
Toyota     Altima     Red  

What to do with Duplicate Records

Now that you know there are duplicates in your table, the question is what to do with them. Depending on what caused the duplicates might indicate what will be done with them. In my vehicles example, the business logic allowed for duplicate records. Although for other situations possible duplicates should not be allowed, and you will need to decide what to do with the duplicates. I will take you through a few examples on what to do with duplicated records to eliminate the duplicates.

In order to delete duplicate records you need to determine which duplicate to delete. Now in the case where we have true duplicates, like my vehicle example above, you really don't have to determine which record to delete, since all duplicate rows are exactly the same. In my other example where the extent of duplication is only associated with the duplicate key values, you will need to determine which one to keep. Do you keep the first one, the last one? How do you identify the first and last one if there is no datetime field helping you determine which one is first and which one is last? Remember ordinal position in SQL Server can change, as records are deleted and added. Let's review a few examples that might help you when deleting duplicates in your environment.

MS SQL Archives

Latest Forum Threads
MS SQL Forum
Topic By Replies Updated
SQL 2005: SSIS: Error using SQL Server credentials poverty 3 August 17th, 07:43 AM
Need help changing table contents nkawtg 1 August 17th, 03:02 AM
SQL Server Memory confifuration bhosalenarayan 2 August 14th, 05:33 AM
SQL Server Primary Key and a Unique Key katty.jonh 2 July 25th, 10:36 AM