Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MySQL

Posted Jul 21, 2010

SQLAntipatterns: Avoiding the Pitfalls of Database Programming

By DatabaseJournal.com Staff

Learn how you too can stamp out common database errors from an SQL Expert's playbook and learn a lot about relational databases along the way. 'Ambiguous Groups' is extracted from 'SQL Antipatterns', published by the Pragmatic Bookshelf.

Title: SQLAntipatterns: Avoiding the Pitfalls of Database Programming
Author: Bill Karwin
Publisher: The Pragmatic Bookshelf
ISBN-10: 1-934356-55-7
ISBN-13: 978-1-934356-55-5
Printed on acid-free paper.
P1.0 printing, May 2010
Version: 2010-7-8

Ambiguous Groups

Suppose your boss needs to know which projects in the bugs database are still active and which projects have been abandoned. One report he asks you to generate is the latest bug reported per product. You write a query using the MySQL database to calculate the greatest value in the date_reported column per group of bugs sharing a given product_id. The report looks like this:

product_name

date_reported

bug_id

 

Open RoundFile

2010-06-01

1234

 

Visual TurboBuilder

2010-02-16

3456

 

ReConsider

2010-01-01

5678

 

Your boss is a detail-oriented person, and he spends some time looking up each bug listed in the report. He notices that the row listed as the most recent for “Open RoundFile” shows a bug_id that isn’t the latest bug. The full data shows the discrepancy:

product_name

date_reported

bug_id

 

Open RoundFile

2009-12-19

1234

This bug_id...

Open RoundFile

2010-06-01

2248

doesn’t match this date

Visual TurboBuilder

2010-02-16

3456

 

Visual TurboBuilder

2010-02-10

4077

 

Visual TurboBuilder

2010-02-16

5150

 

ReConsider

2010-01-01

5678

 

ReConsider

2009-11-09

8063

 

How can you explain this problem? Why does it affect one product but not the others? How can you get the desired report?

Objective: Get Row with Greatest Value per Group

Most programmers who learn SQL get to the stage of using GROUP BY in a query, applying some aggregate function to groups of rows, and getting a result with one row per group. This is a powerful feature that makes it easy to get a wide variety of complex reports using relatively little code.

For example, a query to get the latest bug reported for each product in the bugs database looks like this:


SELECT product_id, MAX(date_reported) AS latest 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

A natural extension to this query is to request the ID of the specific bug with the latest date reported:


SELECT product_id, MAX(date_reported) AS latest, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

However, this query results in either an error or an unreliable answer. This is a common source of confusion for programmers using SQL. The objective is to run a query that not only reports the greatest value in a group (or the least value or the average value) but also includes other attributes of the row where that value is found.

Antipattern: Reference Nongrouped Columns

The root cause of this antipattern is simple, and it reveals a common misconception that many programmers have about how grouping queries work in SQL.

The Single-Value Rule

The rows in each group are those rows with the same value in the column or columns you name after GROUP BY. For example, in the following query, there is one row group for each distinct value in product_id.



SELECT product_id, MAX(date_reported) AS latest 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

Every column in the select-list of a query must have a single value row per row group. This is called the Single-Value Rule. Columns named in the GROUP BY clause are guaranteed to be exactly one value per group, no matter how many rows the group matches.

The MAX( ) expression is also guaranteed to result in a single value for each group: the highest value found in the argument of MAX( ) over all the rows in the group.

However, the database server can’t be so sure about any other column named in the select-list. It can’t always guarantee that the same value occurs on every row in a group for those other columns.

Text Box: GROUP BY and DISTINCT
SQL supports a query modifier called DISTINCT that reduces the rows of the query result so that every row is unique. For example, the following query reports who reported bugs and which days they reported bugs, but only one row per date and person:
SELECT DISTINCT date_reported, reported_by FROM Bugs;
A grouping query can achieve the same result by omitting any aggregate function. The query result is reduced to one row for each distinct pair of values in the column named in the GROUP BY clause:
SELECT date_reported, reported_by FROM Bugs GROUP BY date_reported, reported_by;
Both queries produce the same result and should be optimized and executed similarly, so the difference in this example is only a matter of preference.

SELECT product_id, MAX(date_reported) AS latest, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

In this example, there are many distinct values for bug_id for a given product_id, because the BugsProducts table associates multiple bugs to a given product. In a grouping query that reduces to a single row per product, there’s no way to represent all the values of bug_id.

Since there is no guarantee of a single value per group in the “extra” columns, the database assumes that they violate the Single-Value Rule. Most brands of database report an error if you try to run any query that tries to return a column other than those columns named in the GROUP BY clause or as arguments to aggregate functions.

MySQL and SQLite have different behavior from other brands of database, which we’ll explore in Legitimate Uses of the Antipattern.

Do-What-I-Mean Queries

The common misconception that programmers have is that SQL can guess which bug_id you want in the report, based on the fact that MAX( ) is used in another column. Most people assume that if the query fetches the greatest value, then other columns named will naturally take their value from the same row where that greatest value occurs.

Unfortunately, SQL can’t make this inference in several cases:

  • If two bugs have the exact same value for date_reported and that is the greatest value in the group, which value of bug_id should the query report?
  • If you query for two different aggregate functions, for example MAX( ) and MIN( ), these probably correspond to two different rows in the group. Which bug_id should the query return for this group?

SELECT product_id, MAX(date_reported) AS latest,
   MIN(date_reported) AS earliest, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;
  • If none of the rows in the table matches the value returned by the aggregate function, what is the value of bug_id? This is commonly true for the functions AVG( ), COUNT( ), and SUM( ).

SELECT product_id, SUM(hours) AS total_project_estimate, bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

These are examples of why the Single-Value Rule is important. Not every query that fails to follow this rule would produce an ambiguous result, but many do. It would be clever if the database could tell an ambiguous query from an unambiguous one and produce an error only when the data contains ambiguity. But that would not be good for application reliability; it would mean that the same query might be valid or invalid, depending on the state of data.

How to Recognize the Antipattern

In most brands of database, writing a query that violates the Single-Value Rule should elicit an error immediately as you prepare the query.

The following are examples of error messages given by some brands of database:

  • Firebird 2.1:
    Invalid expression in the select list (not contained in either 
    an aggregate function or the GROUP BY clause)
    
  • IBM DB2 9.5:
    
    An expression starting with "BUG_ID" specified in a SELECT 
    clause, HAVING clause, or ORDER BY clause is not specified in 
    the GROUP BY clause or it is in a SELECT clause, HAVING clause, 
    or ORDER BY clause with a column function and no GROUP BY clause
    is specified.
  • Microsoft SQL Server 2008:
    Column 'Bugs.bug_id' is invalid in the
    select list because it is not contained in either an aggregate function or the
    GROUP BY clause.
    
  • MySQL 5.1, after setting the ONLY_FULL_GROUP SQL mode to disallow ambiguous queries.
    'bugs.b.bug_id' isn't in GROUP BY
  • Oracle 10.2:
    not a GROUP BY expression
  • PostgreSQL 8.3:
    column "bp.bug_id" must
    appear in the GROUP BY clause or be used in an aggregate function

In SQLite and in MySQL, ambiguous columns may contain unexpected and unreliable values. In MySQL, the value returned is from the first row in the group, where first corresponds to physical storage. SQLite gives the opposite result: the value is from the last row in the group. In both cases, the behavior is not documented, and these databases aren’t obligated to work the same in future versions. It’s your responsibility to notice these cases and to design your queries to avoid ambiguity.

Legitimate Uses of the Antipattern

As we’ve seen, MySQL and SQLite can’t guarantee a reliable result for a column that doesn’t fit the Single-Value Rule. There are cases when you can take advantage of the fact that these databases enforce the rule less strictly than other brands.


SELECT b.reported_by, a.account_name 
FROM Bugs b JOIN Accounts a ON (b.reported_by = a.account_id) 
GROUP BY b.reported_by;

In the previous query, the account_name column technically violates the Single-Value Rule, since it’s named neither in the GROUP BY clause nor in an aggregate function. Nevertheless, there is only one value possible for account_name in each group; the groups are based on Bugs.reported_by, which is a foreign key to the Accounts table. Therefore, the groups correspond one-to-one with rows in the Accounts table.

In other words, if you know the value of reported_by, then you know the value of account_name unambiguously, like if you had queried by the primary key of the Accounts table.

This kind of unambiguous relationship is called a functional dependency. The most common example of this is between the primary key of a table and the table’s attributes: account_name is a functional dependency of its primary key, account_id. If you group a query by a table’s primary key column(s), then the groups correspond to a single row of that table, and therefore all other columns of the same table must have a single value per group.

Bugs.reported_by has a similar relationship with the dependent attributes of the Accounts table, because it references the primary key of the Accounts table. When the query groups by the reported_by column, which is a foreign key, the attributes of the Accounts table are functionally dependent, and the query result contains no ambiguity.

However, most brands of database still return an error. Not only is this the behavior required by the SQL standard, but it’s not too expensive to figure out functional dependencies on the fly.[1] But if you use MySQL or SQLite and you’re careful to query only functionally dependent columns, you can use this kind of grouping query and still avoid problems of ambiguity.



MySQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM