The HAVING and GROUP BY SQL clauses

February 8, 2005

Introduction

This month's topic is one that often baffles inexperienced SQL coders. We will look at the GROUP BY clause, and then the difference between conditions placed in the WHERE clause, and the HAVING clause. I have used MySQL to test all of the examples, but most of them should work without any problems on other DBMS' too. We are going to be using the following test data:

CREATE TABLE `writer` (
  `poet` varchar(50) default NULL,
  `anthology` varchar(40) default NULL,
  `copies_in_stock` tinyint(4) 
  		default NULL
);
 
INSERT INTO `writer` VALUES 
('Mongane Wally Serote','Tstetlo',3),
('Mongane Wally Serote',
	'No Baby Must Weep',8),
('Mongane Wally Serote',
	'A Tough Tale',2),
('Douglas Livingstone',
	 'The Skull in the Mud',21),
('Douglas Livingstone',
	'A Littoral Zone',2);

Note that this data structure is not normalized, and is not recommended for a production system - it is here as a quick and easy way to show you the concepts. If you cannot see what is wrong with it, you need to read more about Database Normalization!

GROUP BY

Let's start with the GROUP BY clause. You can use this to group values from a column, and, if you wish, perform calculations on that column. Let's look at a simple example:

mysql> SELECT poet,SUM(copies_in_stock) FROM writer GROUP BY poet;
+----------------------+----------------------+
| poet                 | SUM(copies_in_stock) |
+----------------------+----------------------+
| Douglas Livingstone  |                   23 |
| Mongane Wally Serote |                   13 |
+----------------------+----------------------+

Here, there are many records containing the same poet data (which is the reason why it is not normalized). The GROUP BY function puts all of them together, allowing you to perform a calculation, such as the SUM() used to find the total number of copies_in_stock. Here is another example:

mysql> SELECT poet,copies_in_stock FROM writer GROUP BY poet;
+----------------------+-----------------+
| poet                 | copies_in_stock |
+----------------------+-----------------+
| Douglas Livingstone  |              21 |
| Mongane Wally Serote |               3 |
+----------------------+-----------------+

Just what are the values in the copies_in_stock column referring to? What do they mean? Perhaps looking at the full list of records may shed some light.

mysql> SELECT * FROM writer;
+----------------------+----------------------+-----------------+
| poet                 | anthology            | copies_in_stock |
+----------------------+----------------------+-----------------+
| Mongane Wally Serote | Tstetlo              |               3 |
| Mongane Wally Serote | No Baby Must Weep    |               8 |
| Mongane Wally Serote | A Tough Tale         |               2 |
| Douglas Livingstone  | The Skull in the Mud |              21 |
| Douglas Livingstone  | A Littoral Zone      |               2 |
+----------------------+----------------------+-----------------+

The results are being grouped by the field poet, but there is no sensible way to return a value from the copies_in_stock field. Performing a mathematical function, such as SUM() or AVG() would make sense, but what exactly is this query asking of the database? I don't know, and neither does MySQL, so it simply returns the first result it finds (other DBMS' may return different results). The result returned means nothing, and may not be consistent, as it depends upon the order the DBMS happens to have placed and found the record, so there is no point in running a query like this.

Here is another example, showing a number of common functions:

mysql> SELECT poet, 
MAX(copies_in_stock) max, 
MIN(copies_in_stock) min, 
AVG(copies_in_stock) avg, 
SUM(copies_in_stock) sum 
FROM writer GROUP BY poet;
+----------------------+------+------+---------+------+
| poet                 | max  | min  | avg     | sum  |
+----------------------+------+------+---------+------+
| Douglas Livingstone  |   21 |    2 | 11.5000 |   23 |
| Mongane Wally Serote |    8 |    2 |  4.3333 |   13 |
+----------------------+------+------+---------+------+

The column titles have been given aliases to make the output more readable (note that aliases more often have the keyword AS in front of them, so this identical query may be more familiar to some):

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer GROUP BY poet;
+----------------------+------+------+---------+------+
| poet                 | max  | min  | avg     | sum  |
+----------------------+------+------+---------+------+
| Douglas Livingstone  |   21 |    2 | 11.5000 |   23 |
| Mongane Wally Serote |    8 |    2 |  4.3333 |   13 |
+----------------------+------+------+---------+------+

The HAVING clause

Beginners are commonly confused about the difference between the WHERE and HAVING clauses. They are both conditions, and look similar after all. To start with, let's have a look at these four queries:

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer WHERE copies_in_stock > 5 GROUP BY poet;
+----------------------+------+------+---------+------+
| poet                 | max  | min  | avg     | sum  |
+----------------------+------+------+---------+------+
| Douglas Livingstone  |   21 |   21 | 21.0000 |   21 |
| Mongane Wally Serote |    8 |    8 |  8.0000 |    8 |
+----------------------+------+------+---------+------+

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer GROUP BY poet HAVING copies_in_stock > 5;
ERROR 1054 (42S22): Unknown column 'copies_in_stock' in 'having clause'

The first query only returns results where there are more than five copies in stock. Since each poet only has one title that abides by this condition, the DBMS applies the mathematical calculations to just that one record. Consequently, the results of the MAX(), MIN(), AVG() and SUM() calculations are all the same. The second query attempts to do the same, but using a HAVING clause instead of a WHERE clause. It fails, because MySQL expects a column mentioned in a HAVING clause to be present in the list of columns (see below for a discussion on changes in this behavior with MySQL 5.0.2). This could include a function, so the following query works:

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer GROUP BY poet HAVING max > 5;
+----------------------+------+------+---------+------+
| poet                 | max  | min  | avg     | sum  |
+----------------------+------+------+---------+------+
| Douglas Livingstone  |   21 |    2 | 11.5000 |   23 |
| Mongane Wally Serote |    8 |    2 |  4.3333 |   13 |
+----------------------+------+------+---------+------+

Note that the results in this case are identical, but this would not necessarily always be the case. Let us add another record to illustrate, and then run the same two queries again:

mysql> INSERT INTO writer VALUES('Douglas Livingstone',
	'Giovanni Jacopo Meditates on the High-IQ Haiku',9);

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer GROUP BY poet HAVING max > 5;
+----------------------+------+------+---------+------+
| poet                 | max  | min  | avg     | sum  |
+----------------------+------+------+---------+------+
| Douglas Livingstone  |   21 |    2 | 10.6667 |   32 |
| Mongane Wally Serote |    8 |    2 |  4.3333 |   13 |
+----------------------+------+------+---------+------+
2 rows in set (0.00 sec)

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer WHERE copies_in_stock > 5 GROUP BY poet;
+----------------------+------+------+---------+------+
| poet                 | max  | min  | avg     | sum  |
+----------------------+------+------+---------+------+
| Douglas Livingstone  |   21 |    9 | 15.0000 |   30 |
| Mongane Wally Serote |    8 |    8 |  8.0000 |    8 |
+----------------------+------+------+---------+------+

If you are unclear about why this is the case, take some time to reflect on the differences. The first query (with the condition in the HAVING clause) applies the condition after all records have been returned. Since both of the records have a max greater than 5, there is effectively no limiting of the results. The results are the same as they would be without any conditions, as below:

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer GROUP BY poet;
+----------------------+------+------+---------+------+
| poet                 | max  | min  | avg     | sum  |
+----------------------+------+------+---------+------+
| Douglas Livingstone  |   21 |    2 | 10.6667 |   32 |
| Mongane Wally Serote |    8 |    2 |  4.3333 |   13 |
+----------------------+------+------+---------+------+

The second query, with the condition in the WHERE clause, applies the condition before starting to calculate the results of the functions. Therefore, for Douglas Livingstone, only two results match the condition, and the functions then go to work on this limited set of results. Have a look at the following results, which illustrate how the WHERE clause determines the output.

mysql> SELECT * FROM writer WHERE copies_in_stock > 5;
+----------------------+------------------------------------------+-----------------+
| poet                 | anthology                                | copies_in_stock |
+----------------------+------------------------------------------+-----------------+
| Mongane Wally Serote | No Baby Must Weep                        |               8 |
| Douglas Livingstone  | The Skull in the Mud                     |              21 |
| Douglas Livingstone  | Giovanni Jacopo Meditates on the High-IQ |               9 |
+----------------------+------------------------------------------+-----------------+

The AVG(), for example, of the two Douglas Livingstone results is 15, and the MIN() is 9, matching with the results in the earlier query.

Here are two more examples, showing the difference between HAVING and WHERE:

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer GROUP BY poet HAVING poet > 'E';
+----------------------+------+------+--------+------+
| poet                 | max  | min  | avg    | sum  |
+----------------------+------+------+--------+------+
| Mongane Wally Serote |    8 |    2 | 4.3333 |   13 |
+----------------------+------+------+--------+------+
1 row in set (0.00 sec)

mysql> SELECT poet, 
MAX(copies_in_stock) AS max, 
MIN(copies_in_stock) AS min, 
AVG(copies_in_stock) AS avg, 
SUM(copies_in_stock) AS sum 
FROM writer WHERE poet > 'E' GROUP BY poet;
+----------------------+------+------+--------+------+
| poet                 | max  | min  | avg    | sum  |
+----------------------+------+------+--------+------+
| Mongane Wally Serote |    8 |    2 | 4.3333 |   13 |
+----------------------+------+------+--------+------+

Note that they both return the same results. The poet column appears in the list of columns, so poets with a name starting after 'E' are returned. In this case, it makes no difference to the results which way you apply the condition - the same records are always excluded. However, the positioning of the condition does make a difference to the optimization. Since the WHERE clause is carried out first, while the HAVING clause is carried out last, after all optimizations, it usually makes more sense to place a condition in the WHERE clause, and save the HAVING clause for conditions that are applied to fields,

Changes to the HAVING clause in MYSQL 5.0.2

Older versions of MySQL allowed the HAVING clause to refer to any field listed after the SELECT statement. The SQL standard requires the HAVING clause to also accept fields mentioned in the GROUP BY column, even if they are not mentioned in the SELECT expression. For example, the following query works in MYSQL 5.0.2 and beyond, but not in earlier versions:

mysql> SELECT SUM(copies_in_stock) sum GROUP BY poet HAVING poet > 'E';

Understanding the HAVING and GROUP BY clauses is a good start to SQL mastery. Nothing beats using it in practice though, so get coding, and good luck!

» See All Articles by Columnist Ian Gilfillan








The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers