There are many times as a database developer and administrator that you’ll want to know what entities are being excluded from a straight table join. For instance, it would be instructive for management to identify sales people who haven’t received any bonuses, as these might be deemed to be the underperformers. To provide the requestor with the statistics, you’ll have to determine which sales people are absent from the bonuses table. This is commonly referred to as a Non-matching query, although it is sometimes called a Subtract or Set Difference query as well. MySQL provides not one but three standard means of producing non-matching results. I will be presenting each here, along with another that you might not have considered.
Here are the two sample tables that we’ll execute our queries against:
employees table:
id |
dept_id |
gender |
name |
salary |
emp_id_number |
1 |
2 |
m |
Jon Simpson |
4500 |
1234 |
2 |
4 |
f |
Barbara Breitenmoser |
(NULL) |
9999 |
3 |
3 |
f |
Kirsten Ruegg |
5600 |
3462 |
4 |
1 |
m |
Ralph Teller |
5100 |
6543 |
5 |
2 |
m |
Peter Jonson |
5200 |
9747 |
6 |
2 |
m |
Allan Smithie |
4900 |
6853 |
7 |
4 |
f |
Mary Parker |
5600 |
2223 |
8 |
3 |
f |
Kirsten Ruegg |
4400 |
2765 |
bonuses table:
bonus_id |
emp_id |
amount |
award_date |
1 |
3 |
500 |
2009-10-28 |
2 |
5 |
1000 |
2009-01-03 |
3 |
4 |
250 |
2009-04-22 |
4 |
2 |
900 |
2010-01-14 |
5 |
1 |
450 |
2009-01-03 |
6 |
5 |
1000 |
2009-10-28 |
7 |
8 |
300 |
2007-10-01 |
8 |
8 |
300.5 |
2009-08-12 |
9 |
2 |
250 |
2009-03-16 |
10 |
2 |
500 |
2009-07-30 |
11 |
6 |
350 |
2009-02-12 |
12 |
5 |
300 |
2009-01-15 |
While there is only one instance of each employee in the employees table, there are potentially multiple instances of each in the bonuses table, depending on how many bonuses he or she has received. What we want to see is a list of names of employees’ IDs that do not appear in the bonuses table.
Solution 1: LEFT JOIN / IS NULL
One way to select values present in one table but missing in another is to use a combination of a Left Join with an “IS NULL” test. Here’s the syntax for that:
SELECT field list
FROM left_table
LEFT JOIN
right_table
ON right_table.id = left_table.id
WHERE right_table.id IS NULL
Plugging in our data produces the following Select query:
SELECT emp.name
FROM employees emp
LEFT JOIN
bonuses b
ON b.emp_id = emp.id
WHERE b.emp_id IS NULL;
Running the above statement confirms that Mary Parker is the only employee who has not received a bonus.
Solution 2: NOT IN
The second way to find non-matching records between tables is to use NOT in conjunction with the IN operator. The IN operator allows you to specify multiple values in a WHERE clause, much like a string of ORs chained together. It accepts either a comma-delimited list (value1, value2, etc…) or the results from another Select statement. The latter is what we will be using here.
The syntax for that form is as follows:
SELECT field list
FROM left_table
WHERE left_table.id NOT IN
(
SELECT right_table.id
FROM right_table
)
Here it is again with our specific criteria:
SELECT name
FROM employees
WHERE id NOT IN
(
SELECT emp_id
FROM bonuses
);
It also shows that Mary Parker is the only employee who has not received a bonus.
Solution 3: NOT EXISTS
The main difference between the IN and EXISTS predicate in subquery is the way in which the query gets executed.
- IN: The inner query is executed first and the list of values obtained in the result is used by the outer query. The inner query is executed only once.
- EXISTS: Each row from the outer query result is selected, and compared against the inner query. Hence, the inner query is executed as many times as there are rows in the outer query result.
Here is the syntax for the Not Exists Select statement:
SELECT field list
FROM left_table
WHERE NOT EXISTS
(
right_table.id
FROM right_table
WHERE right_table.id = left_table.id
)
We would plug in our criteria like so:
SELECT name
FROM employees emp
WHERE NOT EXISTS
(
SELECT emp_id
FROM bonuses b
WHERE b.emp_id = emp.id
);
Once again, poor Mary Parker is without a single bonus. This may not bode well for her next performance evaluation.
Bonus Solution: Group By with Union All
As promised, here is a fourth solution, using Group By with UNION ALL. The first step is to Union the two tables that we want to compare using Union ALL (the ALL retains duplicates). Then we group the result on the column that we want to compare. This will cause columns that appear in both inner SELECTS to have a count of two while unmatched records will only be counted once.
Here is the syntax for the Not Exists Select statement:
SELECT field list
FROM (
SELECT DISTINCT field list FROM left_table
UNION ALL
SELECT DISTINCT field list FROM right_table
) AS tbl
GROUP BY field list HAVING COUNT(*) = 1
Be sure to group by the ID field so that different employees with the same name won’t be treated as the same person:
SELECT name
FROM (
SELECT DISTINCT id, name FROM employees
UNION ALL
SELECT DISTINCT emp_id, ” FROM bonuses
) AS tbl
GROUP BY id HAVING COUNT(*) = 1;
You guessed it; Mary Parker is once again left out in the cold.
I have no specific statistics that show how fast this style performs but anecdotal data would suggest that it compares quite favourably.
Which Style Should I Use?
Because of how the MySQL optimizer handles EXISTS
, that method is about 30% less efficient than the others. If speed is your primary concern, then the LEFT JOIN / IS NULL and NOT IN styles are your best bets. Otherwise, you really can’t go wrong with whichever style suits you.