Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum Rss Feed

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
Free Newsletters:

News Via RSS Feed

Rss Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles


Posted September 8, 2016

Selecting the Top N Results by Group in MySQL

By Rob Gravelle

The GROUP BY clause is ideal for breaking up aggregate functions into groups. However, it’s not so great for displaying multiple rows within each group that meet a given criteria.  A common requirement is to fetch the top N rows of each category, for example, largest cities for each country.  A quick Google search will affirm its popularity.  Problem is, when one looks at typical forum responses, very few answers are satisfactory, or utilize non-MySQL vendor-specific solutions.  For those reasons, I’ve compiled a couple of answers that do hit the mark.  Hopefully they’ll benefit you as well.

The Problem

Take a look at the following sample of city data, pulled from the World Database:

country_code    city                 population id
ad              Andorra la Vella     20430      7
ad              Canillo              3292       21
ad              Encamp               11224      33
ad              La Massana           7211       50
ad              Les Escaldes         15854      54 
pa              Achutupo             1753       2053433
pa              Agua Buena           1069       2053450
pa              Aguadulce            17238      2053470
pa              Ailigand             1652       2053491
pa              Alanje               1361       2053494 

I wanted to produce a data set containing the Top 10 cities for the US and Canada based on population.  Grouping by country_code is easy enough; all that’s required is an ORDER BY clause:

SELECT city, population, country_code 
FROM   cities 
WHERE (country_code = 'us' OR country_code = 'ca')
ORDER BY country_code, population DESC;

The problem with this statement is that it does nothing to limit the number of rows returned for each country code.  On the other hand, the LIMIT clause would apply to the entire result set, thereby giving unfair preference to the first country – Canada.

Almost, but not Quite...

We can remedy the above issue by first selecting the largest cities, without regards to country, and limit those results to the top N rows.  The outer SELECT then sorts by country_code and population (in descending order).

SELECT city, population, country_code 
  SELECT city, population, country_code 
  FROM   cities 
  WHERE (country_code = 'us' OR country_code = 'ca')
  ORDER BY population DESC
  Limit 20
) largest_cities
ORDER BY country_code, population DESC;

city            population     country_code
Toronto         4612187         ca
Montreal        3268513         ca
Vancouver       1837970         ca
Calgary         968475          ca
Ottawa          874433          ca
Edmonton        822319          ca
New York        8107916         us
Los Angeles     3877129         us
Chicago         2841952         us
Houston         2027712         us
Philadelphia    1453268         us
Phoenix         1428509         us
San Diego       1287050         us
San Antonio     1256810         us
Dallas          1211704         us
San Jose        897460          us
Detroit         884941          us
Jacksonville    797557          us
Indianapolis    773283          us
Columbus        736836          us

While a step in the right direction, the drawback of these results is that they skew towards the more populous country – the US – with only six of twenty rows being allocated to Canada.

One Solution that Works: UNION with LIMIT

The issue that we’re having with LIMIT is that it applies to the entire result set.  In that case, what if we were to amalgamate one or more result sets using a UNION?  That would allow us to LIMIT each country’s cities to the Top N:

  SELECT city, population, country_code
  FROM   cities 
  WHERE  country_code = 'ca'
  ORDER BY population DESC
  LIMIT 10
  SELECT city, population, country_code
  FROM   cities 
  WHERE  country_code = 'us'
  ORDER BY population DESC
  LIMIT 10

Sure enough, we now have a list of each country’s top 10 cities according to population:

city            population     country_code


Toronto         4612187         ca
Montreal        3268513         ca
Vancouver       1837970         ca
Calgary         968475          ca
Ottawa          874433          ca
Edmonton        822319          ca
Hamilton        653637          ca
Quebec          645623          ca
Winnipeg        632069          ca
Kitchener       409111          ca
New York        8107916         us
Los Angeles     3877129         us
Chicago         2841952         us
Houston         2027712         us
Philadelphia    1453268         us
Phoenix         1428509         us
San Diego       1287050         us
San Antonio     1256810         us
Dallas          1211704         us
San Jose        897460          us

Note that the limiting can either be symmetrical so that each city has the same number of rows, OR asymmetrical, so that different cities receive different numbers of rows, such as 10/20.

Of course, this is not a silver bullet.  It doesn’t take much imagination to envision the nightmare of applying this statement to hundreds of cities!  We need something that has built-in scalability so that it works with two or two hundred countries. 

A Better Solution: Ranking

SQL Server has the ROW_NUMBER() function, which returns the sequential number of a row within a partition of a result set, starting at 1 for the first row in each partition:

    OVER ( [ PARTITION BY value_expression , ... [ n ] ] order_by_clause ) D

Using the above function, we could assign a sequence number to each city so that we could select those that are ranked above a given number:

SELECT city, population, country_code
   SELECT city, population, country_code,
          ROW_NUMBER() over(partition by countries.country_code 
                            order by     countries.country_code, 
                                         cities.population DESC) RowNum
   FROM countries 
     INNER JOIN cities city on countries.country_code = cities.country_code
 ) tbl
 WHERE RowNum <= 10;

Now that’s IF we were using SQL Server.  Currently MySQL does not support the ROW_NUMBER() function, but as a workaround we can use MySQL session variables.  Session variables do not require declaration, and can be used in a query to perform calculations and to store intermediate results. 

The following code is executed for each row.  If @current_country is the same, we increment the @country_rank, otherwise, we re-initialize it to 1.  For the first row @current_country is NULL, so the rank is set to 1 on the first iteration.  The current country_code is then stored in the @current_country variable for the next evaluation:

@country_rank    := IF(@current_country = country_code, @country_rank + 1, 1)
 @current_country := country_code

For correct ranking, we need to include the “ORDER BY country_code, population DESC” clause.

Once we’ve assigned a rank number to each city within its country, we can retrieve the required range of 1 to 10:

SELECT city, population, country_code
   SELECT city, population, country_code,
   @country_rank := IF(@current_country = country_code, 
                         @country_rank + 1, 
                      ) AS country_rank,
   @current_country := country_code
   FROM cities
   WHERE country_code = 'us' OR country_code = 'ca'
   ORDER BY country_code, population DESC
 ) ranked
 WHERE country_rank <= 10;

Like the previous statement, this solution also supports asymmetrical limiting by altering the WHERE clause:

WHERE country_code = 'ca' AND country_rank <= 10
 OR    country_code = 'us' AND country_rank <= 20;

Another benefit of this approach is that we can now retrieve cities by their ranking.  For example, we can fetch the fifth largest city in Canada by setting the country_rank in the WHERE filter to 5:

WHERE country_code = 'ca' AND country_rank = 5


If there’s one lesson that I’ve learned from working with MySQL over the years it’s that when it comes to queries, if there’s a will, there’s usually a way.  Do you have another solution to offer?  By all means, put it in a comment.  I’d love to see what you come up with!

See all articles by Rob Gravelle

MySQL Archives

Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM