Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MySQL

Posted Jul 21, 2010

SQLAntipatterns: Avoiding the Pitfalls of Database Programming - Page 2

By DatabaseJournal.com Staff

Solution: Use Columns Unambiguously

The sections that follow describe several ways you can resolve this antipattern and write unambiguous queries.

Query Only Functionally Dependent Columns

The most straightforward solution is to eliminate ambiguous columns from the query.


SELECT product_id, MAX(date_reported) AS latest 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

The query reveals the date of the latest bug per product, even though it doesn’t report the bug_id corresponding to that latest bug. Sometimes this is enough, so don’t overlook a simple solution.

Using a Correlated Subquery

A correlated subquery contains a reference to the outer query and so produces different results for each row of the outer query. We can use this to find the latest bug per product by running a subquery to search for bugs with the same product and a greater date. When the subquery finds none, the bug in the outer query is the latest.


SELECT bp1.product_id, b1.date_reported AS latest, b1.bug_id 
FROM Bugs b1 JOIN BugsProducts bp1 USING (bug_id) 
WHERE NOT EXISTS
   (SELECT * FROM Bugs b2 JOIN BugsProducts bp2 USING (bug_id)
    WHERE bp1.product_id = bp2.product_id      
	  AND b1.date_reported < b2.date_reported);

Use this solution as a simple solution that is readable and easy to code. However, keep in mind that this solution isn’t likely to be the best for performance, because correlated subqueries are executed once for each row of the outer query.

Using a Derived Table

You can use a subquery as a derived table, producing an interim result that contains only the product_id and the corresponding greatest bug report date for each product. Then use this result to join against the tables so that the query result contains only bugs with the latest date per product.


SELECT m.product_id, m.latest, b1.bug_id 
FROM Bugs b1 JOIN BugsProducts bp1 USING (bug_id) 
JOIN (SELECT bp2.product_id, MAX(b2.date_reported) AS latest
       FROM Bugs b2 JOIN BugsProducts bp2 USING (bug_id)
	    GROUP BY bp2.product_id) m
   ON (bp1.product_id = m.product_id
       AND b1.date_reported = m.latest);

product_id

latest

bug_id

1

2010-06-01

2248

2

2010-02-16

3456

2

2010-02-16

5150

3

2010-01-01

5678

Notice that you can get multiple rows per product if the latest date returned by the subquery matches multiple rows. If you need to ensure a single row per product_id, you can use another grouping function in the outer query:


SELECT m.product_id, m.latest, MAX(b1.bug_id) AS latest_bug_id 
FROM Bugs b1  
JOIN (SELECT product_id, MAX(date_reported) AS latest
       FROM Bugs b2 JOIN BugsProducts USING (bug_id)       
	   GROUP BY product_id) m
   ON (b1.date_reported = m.latest) 
GROUP BY m.product_id, m.latest;

product_id

latest

bug_id

1

2010-06-01

2248

2

2010-02-16

5150

3

2010-01-01

5678

Use the derived table solution as a more scalable alternative to the correlated subquery. The derived table is noncorrelated, so most database brands should be able to execute the subquery once. However, the database must store the interim result set in a temporary table, so this still isn’t the best for performance.

Using a JOIN

You can create a join that tries to match against a set of rows that may not exist. This type of join is called an outer join. Where the matching rows don’t exist, null is used for all columns in that nonexistent row. So, where the query finds null, we know no such row was found.


SELECT bp1.product_id, b1.date_reported AS latest, b1.bug_id 
FROM Bugs b1 JOIN BugsProducts bp1 ON (b1.bug_id = bp1.bug_id) 
LEFT OUTER JOIN (Bugs AS b2 JOIN BugsProducts AS bp2 ON
     (b2.bug_id = bp2.bug_id))
   ON (bp1.product_id = bp2.product_id AND
      (b1.date_reported < b2.date_reported OR      
	  b1.date_reported = b2.date_reported
          AND b1.bug_id < b2.bug_id)) 
WHERE b2.bug_id IS NULL;

product_id

latest

bug_id

1

2010-06-01

2248

2

2010-02-16

5150

3

2010-01-01

5678

It takes a few minutes of gazing at this query, and perhaps some doodles on notepaper, for most people to see how it works. But once you do, this technique can be an important tool.

Use the JOIN solution when the scalability of the query over large sets of data is important. Although it’s a tougher concept to grasp and therefore more difficult to maintain, it often scales better than a subquery-based solution. Remember to measure the performance of several query forms, instead of assuming that one performs better than the other.

Using an Aggregate Function for Extra Columns

You can make the extra column comply with the Single-Value Rule by applying another aggregate function to it.


SELECT product_id, MAX(date_reported) AS latest,
   MAX(bug_id) AS latest_bug_id 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

Use this solution only when you can rely on the latest bug_id being the bug with the latest date, in other words, if bugs are guaranteed to be reported in chronological order.

Concatenating All Values per Group

Finally, you can use another aggregate function on bug_id to avoid violating the Single-Value Rule. MySQL and SQLite support a function GROUP_CONCAT( ) that concatenates all the values in the group into one value. By default, this is a comma-separated string.


SELECT product_id, MAX(date_reported) AS latest,
   GROUP_CONCAT(bug_id) AS bug_id_list 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

product_id

latest

bug_id

1

2010-06-01

1234, 2248

2

2010-02-16

3456, 4077, 5150

3

2010-01-01

5678, 8063

This query doesn’t reveal which bug_id corresponds to the latest date; the bug_id_list includes all bug_id values in each group.

Another disadvantage of this solution is that it isn’t standard SQL. and other brands of database don’t support this function. Some brands of database support custom functions and custom aggregate functions. For example, here’s the solution for PostgreSQL:


CREATE AGGREGATE GROUP_ARRAY (
   BASETYPE = ANYELEMENT,   
   SFUNC = ARRAY_APPEND,   
   STYPE = ANYARRAY,   INITCOND = '{}' 
);

SELECT product_id, MAX(date_reported) AS latest,
   ARRAY_TO_STRING(GROUP_ARRAY(bug_id), ',' ) AS bug_id_list, 
FROM Bugs JOIN BugsProducts USING (bug_id) 
GROUP BY product_id;

Some other brands of database don’t support custom functions, so the solution may require writing a stored procedure to loop over an nongrouped query result, concatenating values manually.

Use this solution when you expect the extra column to have a single value per group but the column still violates the Single-Value Rule.

Title: SQLAntipatterns: Avoiding the Pitfalls of Database Programming
Author: Bill Karwin
Publisher: The Pragmatic Bookshelf
ISBN-10: 1-934356-55-7
ISBN-13: 978-1-934356-55-5
Printed on acid-free paper.
P1.0 printing, May 2010
Version: 2010-7-8



MySQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MySQL Forum
Topic By Replies Updated
MySQL in high availability and transction secure banking application klamor 1 August 28th, 10:24 AM
MySQL rollback UAL225 1 August 28th, 10:15 AM
Browsing a DB file that uses MySql finleytech 1 July 26th, 10:51 AM
php cookie won't pass variable to next page ITdevGirl 0 June 22nd, 12:13 PM