The HAVING and GROUP BY SQL clauses

Introduction

This month’s topic is one
that often baffles inexperienced SQL coders. We will look at the GROUP BY
clause, and then the difference between conditions placed in the WHERE clause,
and the HAVING clause. I have used MySQL to test all of the examples, but most
of them should work without any problems on other DBMS’ too. We are going to be
using the following test data:

CREATE TABLE `writer` (
  `poet` varchar(50) default NULL,
  `anthology` varchar(40) default NULL,
  `copies_in_stock` tinyint(4) 
  		default NULL
);
 
INSERT INTO `writer` VALUES 
('Mongane Wally Serote','Tstetlo',3),
('Mongane Wally Serote',
	'No Baby Must Weep',8),
('Mongane Wally Serote',
	'A Tough Tale',2),
('Douglas Livingstone',
	 'The Skull in the Mud',21),
('Douglas Livingstone',
	'A Littoral Zone',2);

Note that
this data structure is not normalized, and is not recommended for a production
system – it is here as a quick and easy way to show you the concepts. If you cannot
see what is wrong with it, you need to read more about Database
Normalization
!

GROUP BY

Let’s start with the
GROUP BY clause. You can use this to group values from a column, and, if you
wish, perform calculations on that column. Let’s look at a simple example:

mysql> SELECT poet,SUM(copies_in_stock) FROM writer GROUP BY poet;
+----------------------+----------------------+
| poet                 | SUM(copies_in_stock) |
+----------------------+----------------------+
| Douglas Livingstone  |                   23 |
| Mongane Wally Serote |                   13 |
+----------------------+----------------------+

Here, there are many
records containing the same poet data (which is the reason why it is not
normalized). The GROUP BY function puts all of them together, allowing you to
perform a calculation, such as the SUM() used to find the total number of copies_in_stock.
Here is another example:


mysql> SELECT poet,copies_in_stock FROM writer GROUP BY poet;
+———————-+—————–+
| poet | copies_in_stock |
+———————-+—————–+
| Douglas Livingstone | 21 |
| Mongane Wally Serote | 3 |
+———————-+—————–+

Just what are the values
in the copies_in_stock column referring to? What do they mean? Perhaps
looking at the full list of records may shed some light.


mysql> SELECT * FROM writer;
+———————-+———————-+—————–+
| poet | anthology | copies_in_stock |
+———————-+———————-+—————–+
| Mongane Wally Serote | Tstetlo | 3 |
| Mongane Wally Serote | No Baby Must Weep | 8 |
| Mongane Wally Serote | A Tough Tale | 2 |
| Douglas Livingstone | The Skull in the Mud | 21 |
| Douglas Livingstone | A Littoral Zone | 2 |
+———————-+———————-+—————–+

The results are being
grouped by the field poet, but there is no sensible way to return a
value from the copies_in_stock field. Performing a mathematical
function, such as SUM() or AVG() would make sense, but what exactly is this
query asking of the database? I don’t know, and neither does MySQL, so it
simply returns the first result it finds (other DBMS’ may return different
results). The result returned means nothing, and may not be consistent, as it
depends upon the order the DBMS happens to have placed and found the record, so
there is no point in running a query like this.

Here is another example,
showing a number of common functions:


mysql> SELECT poet,
MAX(copies_in_stock) max,
MIN(copies_in_stock) min,
AVG(copies_in_stock) avg,
SUM(copies_in_stock) sum
FROM writer GROUP BY poet;
+———————-+——+——+———+——+
| poet | max | min | avg | sum |
+———————-+——+——+———+——+
| Douglas Livingstone | 21 | 2 | 11.5000 | 23 |
| Mongane Wally Serote | 8 | 2 | 4.3333 | 13 |
+———————-+——+——+———+——+

The column titles have
been given aliases to make the output more readable (note that aliases more
often have the keyword AS in front of them, so this identical query may
be more familiar to some):


mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer GROUP BY poet;
+———————-+——+——+———+——+
| poet | max | min | avg | sum |
+———————-+——+——+———+——+
| Douglas Livingstone | 21 | 2 | 11.5000 | 23 |
| Mongane Wally Serote | 8 | 2 | 4.3333 | 13 |
+———————-+——+——+———+——+

The HAVING clause

Beginners are commonly
confused about the difference between the WHERE and HAVING clauses. They are
both conditions, and look similar after all. To start with, let’s have a look
at these four queries:


mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer WHERE copies_in_stock > 5 GROUP BY poet;
+———————-+——+——+———+——+
| poet | max | min | avg | sum |
+———————-+——+——+———+——+
| Douglas Livingstone | 21 | 21 | 21.0000 | 21 |
| Mongane Wally Serote | 8 | 8 | 8.0000 | 8 |
+———————-+——+——+———+——+

mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer GROUP BY poet HAVING copies_in_stock > 5;
ERROR 1054 (42S22): Unknown column ‘copies_in_stock’ in ‘having clause’

The first query only
returns results where there are more than five copies in stock. Since each poet
only has one title that abides by this condition, the DBMS applies the
mathematical calculations to just that one record. Consequently, the results of
the MAX(), MIN(), AVG() and SUM() calculations are all the same. The second
query attempts to do the same, but using a HAVING clause instead of a WHERE
clause. It fails, because MySQL expects a column mentioned in a HAVING clause
to be present in the list of columns (see below for a discussion on changes in
this behavior with MySQL 5.0.2). This could include a function, so the
following query works:


mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer GROUP BY poet HAVING max > 5;
+———————-+——+——+———+——+
| poet | max | min | avg | sum |
+———————-+——+——+———+——+
| Douglas Livingstone | 21 | 2 | 11.5000 | 23 |
| Mongane Wally Serote | 8 | 2 | 4.3333 | 13 |
+———————-+——+——+———+——+

Note that the results in
this case are identical, but this would not necessarily always be the case. Let
us add another record to illustrate, and then run the same two queries again:


mysql> INSERT INTO writer VALUES(‘Douglas Livingstone’,
‘Giovanni Jacopo Meditates on the High-IQ Haiku’,9);

mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer GROUP BY poet HAVING max > 5;
+———————-+——+——+———+——+
| poet | max | min | avg | sum |
+———————-+——+——+———+——+
| Douglas Livingstone | 21 | 2 | 10.6667 | 32 |
| Mongane Wally Serote | 8 | 2 | 4.3333 | 13 |
+———————-+——+——+———+——+
2 rows in set (0.00 sec)

mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer WHERE copies_in_stock > 5 GROUP BY poet;
+———————-+——+——+———+——+
| poet | max | min | avg | sum |
+———————-+——+——+———+——+
| Douglas Livingstone | 21 | 9 | 15.0000 | 30 |
| Mongane Wally Serote | 8 | 8 | 8.0000 | 8 |
+———————-+——+——+———+——+

If you are unclear about
why this is the case, take some time to reflect on the differences. The first
query (with the condition in the HAVING clause) applies the condition after all
records have been returned. Since both of the records have a max greater
than 5, there is effectively no limiting of the results. The results are the
same as they would be without any conditions, as below:


mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer GROUP BY poet;
+———————-+——+——+———+——+
| poet | max | min | avg | sum |
+———————-+——+——+———+——+
| Douglas Livingstone | 21 | 2 | 10.6667 | 32 |
| Mongane Wally Serote | 8 | 2 | 4.3333 | 13 |
+———————-+——+——+———+——+

The second query, with
the condition in the WHERE clause, applies the condition before starting to
calculate the results of the functions. Therefore, for Douglas Livingstone,
only two results match the condition, and the functions then go to work on this
limited set of results. Have a look at the following results, which illustrate
how the WHERE clause determines the output.


mysql> SELECT * FROM writer WHERE copies_in_stock > 5;
+———————-+——————————————+—————–+
| poet | anthology | copies_in_stock |
+———————-+——————————————+—————–+
| Mongane Wally Serote | No Baby Must Weep | 8 |
| Douglas Livingstone | The Skull in the Mud | 21 |
| Douglas Livingstone | Giovanni Jacopo Meditates on the High-IQ | 9 |
+———————-+——————————————+—————–+

The AVG(), for example,
of the two Douglas Livingstone results is 15, and the MIN() is 9, matching with
the results in the earlier query.

Here are two more examples,
showing the difference between HAVING and WHERE:


mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer GROUP BY poet HAVING poet > ‘E’;
+———————-+——+——+——–+——+
| poet | max | min | avg | sum |
+———————-+——+——+——–+——+
| Mongane Wally Serote | 8 | 2 | 4.3333 | 13 |
+———————-+——+——+——–+——+
1 row in set (0.00 sec)

mysql> SELECT poet,
MAX(copies_in_stock) AS max,
MIN(copies_in_stock) AS min,
AVG(copies_in_stock) AS avg,
SUM(copies_in_stock) AS sum
FROM writer WHERE poet > ‘E’ GROUP BY poet;
+———————-+——+——+——–+——+
| poet | max | min | avg | sum |
+———————-+——+——+——–+——+
| Mongane Wally Serote | 8 | 2 | 4.3333 | 13 |
+———————-+——+——+——–+——+

Note that they both return
the same results. The poet column appears in the list of columns, so
poets with a name starting after ‘E’ are returned. In this case, it makes no
difference to the results which way you apply the condition – the same records
are always excluded. However, the positioning of the condition does make a
difference to the optimization. Since the WHERE clause is carried out first,
while the HAVING clause is carried out last, after all optimizations, it
usually makes more sense to place a condition in the WHERE clause, and save the
HAVING clause for conditions that are applied to fields,

Changes to the HAVING clause in MYSQL
5.0.2

Older versions of MySQL
allowed the HAVING clause to refer to any field listed after the SELECT
statement. The SQL standard requires the HAVING clause to also accept fields
mentioned in the GROUP BY column, even if they are not mentioned in the SELECT
expression. For example, the following query works in MYSQL 5.0.2 and beyond,
but not in earlier versions:

mysql> SELECT SUM(copies_in_stock) sum GROUP BY poet HAVING poet > 'E';

Understanding the HAVING
and GROUP BY clauses is a good start to SQL mastery. Nothing beats using it in practice
though, so get coding, and good luck!

»


See All Articles by Columnist
Ian Gilfillan

Ian Gilfillan
Ian Gilfillan
Ian Gilfillan lives in Cape Town, South Africa. He is the author of the book 'Mastering MySQL 4', published by Sybex, and has been working with MySQL since 1997. These days he develops mainly in PHP and MySQL, although confesses to starting out with BASIC and COBOL, way back when, and still has a soft spot for Perl. He developed South Africa's first online grocery store, and has developed and taught internet development and other technical courses for various institutions. He has majors in Programming and Information Systems, as well as English and Philosophy. For 5 years he was Lead Developer and IT Manager for Independent Online, South Africa's premier news portal. However, he has now 'retired' from fulltime work, and is hoping that his next book will be more in the style of William Blake and Allen Ginsberg.

Latest Articles