Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Slideshows
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MySQL

Posted November 10, 2016

Listing Duplicate Values by Group

By Rob Gravelle

A Case Study of Cities

Without question, one of the most common tasks performed by Database Administrators (DBAs) is identifying and weeding out duplicate values in tables. The trouble is, what constitutes a duplicate is not universal across datasets.  Despite the inordinate number of queries written by other DBAs to locate duplicate values in their database tables, the real challenge is in locating a useable SQL statement to go by.  Even if you do come across a similar query, it can be a challenge to adapt it to your particular situation.  Therefore, I’d like to present a few solutions that took me a while to find how-tos on.  With any luck, they’ll save you some time down the road.

The Test Data

As in the Selecting the Top N Results by Group in MySQL tutorial, we’ll continue to use the cities table of the World Database.  The accompanying link will bring you to the Other MySQL Documentation page, where you’ll find the World Database under the “Example Databases” heading.  It’s a great test table because it contains very raw data, including lots of nulls, duplicates, typos, and other unsanitized entries!  The downside is that it is VERY large.  Therefore, it is imperative that you create indexes on any fields that you plan on searching.  Otherwise, you may wait a very long time for your queries to complete.

The World Database cities table contains the country_code, city, accent_city, region_code, population, latitude, longitude, and id as follows:

ad, aixas,              Aixàs,              06, 0,     42.483334, 1.466667, 1
ad, aixirivali,         Aixirivali,         06, 0,     42.466667, 1.500000, 2
ad, aixirivall,         Aixirivall,         06, 0,     42.466667, 1.500000, 3
ad, aixirvall,          Aixirvall,          06, 0,     42.466667, 1.500000, 4
ad, aixovall,           Aixovall,           06, 0,     42.466667, 1.483333, 5
ad, andorra,            Andorra,            07, 0,     42.500000, 1.516667, 6
ad, andorra la vella,   Andorra la Vella,   07, 20430, 42.500000, 1.516667, 7
ad, andorra-vieille,    Andorra-Vieille,    07, 0,     42.500000, 1.516667, 8
ad, andorre,            Andorre,            07, 0,     42.500000, 1.516667, 9
ad, andorre-la-vieille, Andorre-la-Vieille, 07, 0,     42.500000, 1.516667, 10 

etc...

A Basic Select Statement

An easy way to find duplicate values in a table is to group the results according to the field that you’re looking at.  In practical terms, this is accomplished using the MySQL GROUP BY clause in conjunction with HAVING.  By filtering the field values to those that have a Count that is greater than one, our results will be limited to values that are duplicated:

SELECT my_column, COUNT(*) as count
FROM my_table
GROUP BY my_column
HAVING COUNT(*) > 1;

Applying the above statement to the World Database cities table, we would find duplicate cities as follows:

SELECT city, COUNT(*) as duplicate_cities
FROM cities 
GROUP BY city
HAVING duplicate_cities > 1 
AND city != '';  -- don’t include blank values

Here are the first several rows of the result set:

country_code, city,       duplicate_cities
al,           a,          92
dk,           aas,        11
af,           ab barik,   14
ir,           ab garm,    13
af,           ab-e barik, 11
ir,           ab-e garm,  12
af,           aba,        25
ph,           abaca,      16
az,           abad,       14
bj,           abala,      14
bn,           abang,      11
af,           abay        17
ar,           abbott,     11
az,           abdal,      11
dz,           abdi,       11
ci,           abe,        13
ao,           abel,       15
ag,           aberdeen,   29
etc...

Right off the top, there should be something nagging at you that there seems to be an extraordinary amount of redundancy in this table.

Searching for “Aberdeen” returns some telling information.

>select * from cities where city = "aberdeen";

country_code, city,     accent_city, region_code, population, latitude,   longitude,   id
 ag,           aberdeen, Aberdeen,    04,          0,          17.066668,  -61.816666,  89288
 au,           aberdeen, Aberdeen,    02,          0,          -32.165878, 150.890030,  150888
 ca,           aberdeen, Aberdeen,    11,          0,          52.316666,  -106.283333, 346352
 gb,           aberdeen, Aberdeen,    T5,          183791,     57.133331,  -2.100000,   986530
 gb,           aberdeen, Aberdeen,    T6,          0,          57.166668, - 2.666667,   986531
 etc...

Sure enough, there are 29 instances of it in the table.  However, the other fields reveal that these are different Aberdeens, within different regions and countries.  The lesson here is that it’s vitally important to identify what makes a value unique.  It’s not just the ID.

Grouping by Another Column

For a more detailed output, it’s often useful to group by an aggregate’s parent column, such as an employee’s section or company, a specific fruit’s food group, or, as in this case, a city’s parent country (country_code). 

SELECT country_code,

       COUNT(city) AS city_duplicates_for_country
FROM cities
GROUP BY country_code 
HAVING city_duplicates > 1;

This will count the number of duplicate cities per country:

country_code,
city_duplicates_for_country
-------------------------------------------------------
kr,           1222
pk,           3900
eg,           2121
az,           1301
dk,           443
etc...
la,           79
hu,           855
br,           2373
st,           11
id,           3822
cn,           2350
td,           1175
lb,           1342
sy,           2430
etc...

Generating a More Detailed Result Set

While this updated statement does tell us more about the distribution of duplicates amongst countries, it does not tell us which cities are duplicated.  To obtain that information, we need to add the city field to the field list AS WELL AS the GROUP BY clause:

SELECT
country_code, 
       city,
       COUNT(city) AS city_duplicates_for_country
FROM cities
GROUP BY country_code, city
HAVING city_duplicates > 1;

Note that omitting the city from the GROUP BY clause will produce misleading results, as the duplicate count will still apply to the entire country even though a specific city will be listed.  This occurs because, unlike most other relational databases, MySQL permits the mixing of aggregate and non-aggregate values in a single SELECT.  In order to do that, it displays the first city for that country, along with the aggregated data (the count).  While there may indeed be some practical applications for this behaviour, this is not one of them!

In the next result set, you can be sure that there are not 1222 duplicates for “aaba kebire” in one country.  In fact, because MySQL chooses the first city for that country, it will likely not be a duplicate at all!

country_code,
city,         city_duplicates_for_country
-------------------------------------------------------
kr,           a,            1222
pk,           a,            3900
eg,           a,            2121
az,           a,            1301
dk,           a,            443
etc...
la,           a alao,       79
hu,           a-akna,       855
br,           a. alvarenga, 2373
st,           a. andrade,   11
id,           aa,           3822
cn,           aa,           2350
td,           aaba,         1175
lb,           aaba,         1342
sy,           aaba kebire,  2430
etc...

Compare that to the correct data:

country_code,
city,            city_duplicates_for_country
----------------------------------------------------------
al,           a,               4
az,           a,               12
cn,           a,               2
lt,           a,               6
se,           a,               2
mk,           a,               5
no,           a,               3
pk,           a,               2
ru,           a,               52
us,           a country place, 2
pt,           a dos francos,   2
no,           a i afjord,      2
etc...

The second result set contains very different cities and counts.

Conclusion

As we saw here today, there’s a lot more to finding duplicates than searching for multiple instances of a column value.  It is often necessary to break down counts by two-or-more columns in order to determine the number of true duplicates. With regards to cities, think of the Simpsons: they live in Springfield, but which Springfield???

See all articles by Rob Gravelle



MySQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM