Subqueries that can't be rewritten as
joins
Standard SQL does not
support joins in DELETE statements. If you wanted to delete all records in animal
that did not contain an associated food, you would be able to do it quite
simply with a subquery, as follows:
mysql> DELETE FROM animal
WHERE id NOT IN (SELECT animal_id
FROM animal_food
WHERE animal_id=animal.id);
Query OK, 2 rows affected (0.00 sec)
Only the two records
remain:
mysql> SELECT * FROM animal;
+----+---------------------+
| id | name |
+----+---------------------+
| 2 | Chacma Baboon |
| 4 | Small Grey Mongoose |
+----+---------------------+
However, the subquery
inside a DELETE statement syntax is not supported in MySQL 4.0. To achieve the
same result, you would need to pass the burden to your application. You would
first have to SELECT the records you want to delete, then place the results
into a variable, and use this variable to create a DELETE statement. Your
application logic would be as follows:
SELECT id FROM animal
WHERE NOT EXISTS (SELECT animal_id
FROM animal_food
WHERE animal.id=animal_id);
Create a string variable
called $results containing the comma-separated results from the former query
DELETE FROM animal WHERE ID IN ($results)
However, it is not only
DELETE statement subqueries that cannot be rewritten. Some SELECT subqueries
are also impossible to rewrite. Here is another example:
mysql> SELECT name FROM animal WHERE id=(SELECT MAX(id) FROM animal_food);
+---------------------+
| name |
+---------------------+
| Small Grey Mongoose |
+---------------------+
The MAX() function
prevents this from being written as a join. MySQL 4.1 is a necessity for doing
these sorts of queries.
Can subqueries be better?
Let's look at a more
complex example:
mysql> SELECT animal.id,name FROM animal
WHERE animal.id IN (1,2,3) AND EXISTS (SELECT NULL FROM animal_food
WHERE animal_food.id>2 AND animal.id=animal_food.id ) ORDER BY name;
+----+-----------+
| id | name |
+----+-----------+
| 3 | Porcupine |
+----+-----------+
This can also be
rewritten as a join, as follows:
mysql> SELECT DISTINCT animal.id,name FROM animal,animal_food
WHERE animal.id IN (1,2,3) AND animal_food.id > 2 AND animal.id=animal_food.id ORDER BY name;
+----+-----------+
| id | name |
+----+-----------+
| 3 | Porcupine |
+----+-----------+
Which is better? EXPLAIN
shows us the following:
Click for full EXPLAIN SELECT(3) code
With a subquery, MySQL
has to examine 4 rows. In addition, it does not use an index. Without the
subquery, MySQL needs to examine 3 rows and uses an index. However, it also
uses a temporary table, normally an indication of a sub-optimal query. The
question of which is better is then quite complex, and highly dependant on the
dataset, the tuning parameters (and the DBMS - do not assume MySQL optimizes
the same way as other DBMS'). Does the temporary table have to be created on
the disk or in memory? How large is the table that is going to be scanned
without an index? I suggest doing some thorough benchmarking in your
environment to see which suits you best.
Uncorrelated and Correlated subqueries
Let's further complicate
matters, and rewrite the correlated subquery as an uncorrelated subquery. The
main difference between the two kinds is that, in a correlated subquery, the
inner query is dependant upon the outer query, while, with an uncorrelated
query, the inner query does not depend upon the outer query. Instead, it runs
just once. Whether this is better or not again depends on your data set. If the
number of rows returned by the inner part of the uncorrelated query is relatively
large, it can counteract the benefit of only running once. The inner part of
the correlated query needs to be performed a number of times - again the
efficiency of this depends on your situation. The advice of some older texts to
always use a correlated query is flawed. Here is the query as an uncorrelated
query:
mysql> SELECT animal.id,name FROM animal
WHERE animal.id IN (1,2,3) AND animal.id
IN (SELECT animal_id FROM animal_food WHERE animal_food.id>2 ) ORDER BY name;
+----+---------------+
| id | name |
+----+---------------+
| 2 | Chacma Baboon |
+----+---------------+
And using EXPLAIN:

Click for full EXPLAIN SELECT (4) code
Sixteen rows need to be
examined. Even though there is no temporary table, in our limited data set this
is not the most efficient query. Nevertheless, results will vary. There is no
easy answer as to which is best. In this case "it depends,"
"sometimes one, sometimes the other" really is the best answer! So
benchmark your results in your environment, and you could see significant
improvements. Good luck!
»
See All Articles by Columnist Ian Gilfillan