Solution: Use Columns Unambiguously
The sections that follow describe several ways you can resolve
this antipattern and write unambiguous queries.
Query Only Functionally Dependent Columns
The most straightforward solution is to eliminate ambiguous
columns from the query.
SELECT product_id, MAX(date_reported) AS latest
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
The query reveals the date of the latest bug per product, even
though it doesnt report the bug_id corresponding to that latest bug. Sometimes
this is enough, so dont overlook a simple solution.
Using a Correlated Subquery
A correlated subquery contains a reference to the outer query and
so produces different results for each row of the outer query. We can use this
to find the latest bug per product by running a subquery to search for bugs
with the same product and a greater date. When the subquery finds none, the bug
in the outer query is the latest.
SELECT bp1.product_id, b1.date_reported AS latest, b1.bug_id
FROM Bugs b1 JOIN BugsProducts bp1 USING (bug_id)
WHERE NOT EXISTS
(SELECT * FROM Bugs b2 JOIN BugsProducts bp2 USING (bug_id)
WHERE bp1.product_id = bp2.product_id
AND b1.date_reported < b2.date_reported);
Use this solution as a simple solution that is readable and easy
to code. However, keep in mind that this solution isnt likely to be the best
for performance, because correlated subqueries are executed once for each row
of the outer query.
Using a Derived Table
You can use a subquery as a derived table, producing an
interim result that contains only the product_id and the corresponding greatest
bug report date for each product. Then use this result to join against the
tables so that the query result contains only bugs with the latest date per
product.
SELECT m.product_id, m.latest, b1.bug_id
FROM Bugs b1 JOIN BugsProducts bp1 USING (bug_id)
JOIN (SELECT bp2.product_id, MAX(b2.date_reported) AS latest
FROM Bugs b2 JOIN BugsProducts bp2 USING (bug_id)
GROUP BY bp2.product_id) m
ON (bp1.product_id = m.product_id
AND b1.date_reported = m.latest);
|
product_id
|
latest
|
bug_id
|
|
1
|
2010-06-01
|
2248
|
|
2
|
2010-02-16
|
3456
|
|
2
|
2010-02-16
|
5150
|
|
3
|
2010-01-01
|
5678
|
Notice that you can get multiple rows per product if the latest
date returned by the subquery matches multiple rows. If you need to ensure a
single row per product_id, you can use another grouping function in the outer
query:
SELECT m.product_id, m.latest, MAX(b1.bug_id) AS latest_bug_id
FROM Bugs b1
JOIN (SELECT product_id, MAX(date_reported) AS latest
FROM Bugs b2 JOIN BugsProducts USING (bug_id)
GROUP BY product_id) m
ON (b1.date_reported = m.latest)
GROUP BY m.product_id, m.latest;
|
product_id
|
latest
|
bug_id
|
|
1
|
2010-06-01
|
2248
|
|
2
|
2010-02-16
|
5150
|
|
3
|
2010-01-01
|
5678
|
Use the derived table solution as a more scalable alternative to
the correlated subquery. The derived table is noncorrelated, so most database
brands should be able to execute the subquery once. However, the database must
store the interim result set in a temporary table, so this still isnt the best
for performance.
Using a JOIN
You can create a join that tries to match against a set of rows
that may not exist. This type of join is called an outer join. Where the
matching rows dont exist, null is used for all columns in that nonexistent
row. So, where the query finds null, we know no such row was found.
SELECT bp1.product_id, b1.date_reported AS latest, b1.bug_id
FROM Bugs b1 JOIN BugsProducts bp1 ON (b1.bug_id = bp1.bug_id)
LEFT OUTER JOIN (Bugs AS b2 JOIN BugsProducts AS bp2 ON
(b2.bug_id = bp2.bug_id))
ON (bp1.product_id = bp2.product_id AND
(b1.date_reported < b2.date_reported OR
b1.date_reported = b2.date_reported
AND b1.bug_id < b2.bug_id))
WHERE b2.bug_id IS NULL;
|
product_id
|
latest
|
bug_id
|
|
1
|
2010-06-01
|
2248
|
|
2
|
2010-02-16
|
5150
|
|
3
|
2010-01-01
|
5678
|
It takes a few minutes of gazing at this query, and perhaps some
doodles on notepaper, for most people to see how it works. But once you do,
this technique can be an important tool.
Use the JOIN solution when the scalability of the query over
large sets of data is important. Although its a tougher concept to grasp and
therefore more difficult to maintain, it often scales better than a
subquery-based solution. Remember to measure the performance of several query
forms, instead of assuming that one performs better than the other.
Using an Aggregate Function for Extra Columns
You can make the extra column comply with the Single-Value Rule
by applying another aggregate function to it.
SELECT product_id, MAX(date_reported) AS latest,
MAX(bug_id) AS latest_bug_id
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
Use this solution only when you can rely on the latest bug_id
being the bug with the latest date, in other words, if bugs are guaranteed to
be reported in chronological order.
Concatenating All Values per Group
Finally, you can use another aggregate function on bug_id to
avoid violating the Single-Value Rule. MySQL and SQLite support a function
GROUP_CONCAT( ) that concatenates all the values in the group into one value.
By default, this is a comma-separated string.
SELECT product_id, MAX(date_reported) AS latest,
GROUP_CONCAT(bug_id) AS bug_id_list
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
|
product_id
|
latest
|
bug_id
|
|
1
|
2010-06-01
|
1234, 2248
|
|
2
|
2010-02-16
|
3456, 4077, 5150
|
|
3
|
2010-01-01
|
5678, 8063
|
This query doesnt reveal which bug_id corresponds to the latest
date; the bug_id_list includes all bug_id values in each group.
Another disadvantage of this solution is that it isnt standard
SQL. and other brands of database dont support this function. Some brands of
database support custom functions and custom aggregate functions. For example,
heres the solution for PostgreSQL:
CREATE AGGREGATE GROUP_ARRAY (
BASETYPE = ANYELEMENT,
SFUNC = ARRAY_APPEND,
STYPE = ANYARRAY, INITCOND = '{}'
);
SELECT product_id, MAX(date_reported) AS latest,
ARRAY_TO_STRING(GROUP_ARRAY(bug_id), ',' ) AS bug_id_list,
FROM Bugs JOIN BugsProducts USING (bug_id)
GROUP BY product_id;
Some other brands of database dont support custom functions, so
the solution may require writing a stored procedure to loop over an nongrouped
query result, concatenating values manually.
Use this solution when you expect the extra column to have a
single value per group but the column still violates the Single-Value Rule.
Title: SQLAntipatterns: Avoiding the Pitfalls of Database Programming
Author: Bill Karwin
Publisher: The Pragmatic Bookshelf
ISBN-10: 1-934356-55-7
ISBN-13: 978-1-934356-55-5
Printed on acid-free paper.
P1.0 printing, May 2010
Version: 2010-7-8