Learn how you too can stamp out common database errors from an SQL Expert’s playbook and learn a lot about relational databases along the way. ‘Ambiguous Groups’ is extracted from ‘SQL Antipatterns’, published by the Pragmatic Bookshelf.
Title: SQLAntipatterns: Avoiding the Pitfalls of Database Programming
Author: Bill Karwin
Publisher: The Pragmatic Bookshelf
ISBN-10: 1-934356-55-7
ISBN-13: 978-1-934356-55-5
Printed on acid-free paper.
P1.0 printing, May 2010
Version: 2010-7-8
Ambiguous Groups
Suppose your boss needs to know which projects in the bugs
database are still active and which projects have been abandoned. One report he
asks you to generate is the latest bug reported per product. You write a query
using the MySQL database to calculate the greatest value in the date_reported
column per group of bugs sharing a given product_id. The report looks like
this:
product_name |
date_reported |
bug_id |
|
Open RoundFile |
2010-06-01 |
1234 |
|
Visual TurboBuilder |
2010-02-16 |
3456 |
|
ReConsider |
2010-01-01 |
5678 |
|
Your boss is a detail-oriented person, and he spends some time
looking up each bug listed in the report. He notices that the row listed as the
most recent for “Open RoundFile” shows a bug_id that isn’t the latest bug. The
full data shows the discrepancy:
product_name |
date_reported |
bug_id |
|
Open RoundFile |
2009-12-19 |
1234 |
This bug_id… |
Open RoundFile |
2010-06-01 |
2248 |
doesn’t match this |
Visual TurboBuilder |
2010-02-16 |
3456 |
|
Visual TurboBuilder |
2010-02-10 |
4077 |
|
Visual TurboBuilder |
2010-02-16 |
5150 |
|
ReConsider |
2010-01-01 |
5678 |
|
ReConsider |
2009-11-09 |
8063 |
|
How can you explain this problem? Why does it affect one product
but not the others? How can you get the desired report?
Objective: Get Row with Greatest Value per Group
Most programmers who learn SQL get to the stage of using GROUP BY
in a query, applying some aggregate function to groups of rows, and getting a
result with one row per group. This is a powerful feature that makes it easy to
get a wide variety of complex reports using relatively little code.
For example, a query to get the latest bug reported for each
product in the bugs database looks like this:
SELECT product_id, MAX(date_reported) AS latest
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
A natural extension to this query is to request the ID of the
specific bug with the latest date reported:
SELECT product_id, MAX(date_reported) AS latest, bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
However, this query results in either an error or an unreliable
answer. This is a common source of confusion for programmers using SQL. The
objective is to run a query that not only reports the greatest value in a group
(or the least value or the average value) but also includes other attributes of
the row where that value is found.
Antipattern: Reference Nongrouped Columns
The root cause of this antipattern is simple, and it reveals a
common misconception that many programmers have about how grouping queries work
in SQL.
The Single-Value Rule
The rows in each group are those rows with the same value in the
column or columns you name after GROUP BY. For example, in the following query,
there is one row group for each distinct value in product_id.
SELECT product_id, MAX(date_reported) AS latest
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
Every column in the select-list of a query must have a single
value row per row group. This is called the Single-Value Rule. Columns
named in the GROUP BY clause are guaranteed to be exactly one value per group,
no matter how many rows the group matches.
The MAX( ) expression is also guaranteed to result in a single
value for each group: the highest value found in the argument of MAX( ) over
all the rows in the group.
However, the database server can’t be so sure about any other
column named in the select-list. It can’t always guarantee that the same value
occurs on every row in a group for those other columns.
SELECT product_id, MAX(date_reported) AS latest, bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
In this example, there are many distinct values for bug_id for a
given product_id, because the BugsProducts table associates multiple bugs to a
given product. In a grouping query that reduces to a single row per product,
there’s no way to represent all the values of bug_id.
Since there is no guarantee of a single value per group in the
“extra” columns, the database assumes that they violate the Single-Value Rule.
Most brands of database report an error if you try to run any query that tries
to return a column other than those columns named in the GROUP BY clause or as
arguments to aggregate functions.
MySQL and SQLite have different behavior from other brands of
database, which we’ll explore in Legitimate Uses of the Antipattern.
Do-What-I-Mean Queries
The common misconception that programmers have is that SQL can
guess which bug_id you want in the report, based on the fact that MAX( ) is
used in another column. Most people assume that if the query fetches the
greatest value, then other columns named will naturally take their value from
the same row where that greatest value occurs.
Unfortunately, SQL can’t make this inference in several cases:
-
If two bugs have the exact same value for date_reported and that
is the greatest value in the group, which value of bug_id should the query
report? -
If you query for two different aggregate functions, for example
MAX( ) and MIN( ), these probably correspond to two different rows in the
group. Which
bug_id should the query return for this group?
SELECT product_id, MAX(date_reported) AS latest,
MIN(date_reported) AS earliest, bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
- If none of the rows in the table matches the value returned by
the aggregate function, what is the value of bug_id? This is commonly true for
the functions AVG( ), COUNT( ), and SUM( ).
SELECT product_id, SUM(hours) AS total_project_estimate, bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
These are examples of why the Single-Value Rule is important. Not
every query that fails to follow this rule would produce an ambiguous result,
but many do. It would be clever if the database could tell an ambiguous query
from an unambiguous one and produce an error only when the data contains ambiguity.
But that would not be good for application reliability; it would mean that the
same query might be valid or invalid, depending on the state of data.
How to Recognize the Antipattern
In most brands of database, writing a query that violates the
Single-Value Rule should elicit an error immediately as you prepare the query.
The following are examples of error messages given by some brands
of database:
-
Firebird 2.1:
Invalid expression in the select list (not contained in either an aggregate function or the GROUP BY clause)
-
IBM DB2 9.5:
An expression starting with "BUG_ID" specified in a SELECT clause, HAVING clause, or ORDER BY clause is not specified in the GROUP BY clause or it is in a SELECT clause, HAVING clause, or ORDER BY clause with a column function and no GROUP BY clause is specified.
-
Microsoft SQL Server 2008:
Column 'Bugs.bug_id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
-
MySQL 5.1, after setting the ONLY_FULL_GROUP SQL mode to disallow
ambiguous queries.'bugs.b.bug_id' isn't in GROUP BY
-
Oracle 10.2:
not a GROUP BY expression
-
PostgreSQL 8.3:
column "bp.bug_id" must appear in the GROUP BY clause or be used in an aggregate function
In SQLite and in MySQL, ambiguous columns may contain unexpected
and unreliable values. In MySQL, the value returned is from the first
row in the group, where first corresponds to physical storage. SQLite gives the
opposite result: the value is from the last row in the group. In both
cases, the behavior is not documented, and these databases aren’t obligated to
work the same in future versions. It’s your responsibility to notice these
cases and to design your queries to avoid ambiguity.
Legitimate Uses of the Antipattern
As we’ve seen, MySQL and SQLite can’t guarantee a reliable result
for a column that doesn’t fit the Single-Value Rule. There are cases when you
can take advantage of the fact that these databases enforce the rule less
strictly than other brands.
SELECT b.reported_by, a.account_name
FROM Bugs b JOIN Accounts a ON (b.reported_by = a.account_id)
GROUP BY b.reported_by;
In the previous query, the account_name column technically
violates the Single-Value Rule, since it’s named neither in the GROUP BY clause
nor in an aggregate function. Nevertheless, there is only one value possible
for account_name in each group; the groups are based on Bugs.reported_by, which
is a foreign key to the Accounts table. Therefore, the groups correspond
one-to-one with rows in the Accounts table.
In other words, if you know the value of reported_by, then you
know the value of account_name unambiguously, like if you had queried by the
primary key of the Accounts table.
This kind of unambiguous relationship is called a functional
dependency. The most common example of this is between the primary key of a
table and the table’s attributes: account_name is a functional dependency of
its primary key, account_id. If you group a query by a table’s primary key
column(s), then the groups correspond to a single row of that table, and
therefore all other columns of the same table must have a single value per
group.
Bugs.reported_by has a similar relationship with the dependent
attributes of the Accounts table, because it references the primary key of the
Accounts table. When the query groups by the reported_by column, which is a
foreign key, the attributes of the Accounts table are functionally dependent,
and the query result contains no ambiguity.
However, most brands of database still return an error. Not only
is this the behavior required by the SQL standard, but it’s not too expensive
to figure out functional dependencies on the fly.[1] But if you use
MySQL or SQLite and you’re careful to query only functionally dependent
columns, you can use this kind of grouping query and still avoid problems of
ambiguity.