Use Derived Tables in Your MySQL Queries to Improve Database Performance
December 3, 2010
There has been much debate as to whether derived tables, views, or temporary tables are actually faster in terms of database performance. The fact is that there is no simple answer, as different queries require different optimizations. Rob Gravelle presents and evaluates some uses for derived tables that highlight their value in many applications.
In version 4.1, MySQL added support for derived tables (also referred to as inline views). In very basic terms, a derived table is a virtual table thats returned from a SELECT statement. This concept is similar to temporary tables, but using derived tables in your SELECT statements is much simpler because they dont require all the steps that temporary tables do. There has been much debate as to whether derived tables, views, or temporary tables are actually faster in terms of performance. Supporters of derived tables cite that they allow to you to replace the code required for managing a temporary table with a single command, which can be faster than the series it replaces. The fact is that there is no simple answer, as different queries require different optimizations; each has its place in the database administrators arsenal. Today, Im going to present and evaluate some uses for derived tables that highlight their value in many applications.
A lot of people use the terms derived tables and subqueries interchangeably. Not surprising, considering that a subquery is defined as any query that appears within another SQL command. These can be further broken down into stand-alone sub-queries, which are those that you can run independent of the command that contains it, and correlated, which relate to fields from the containing command. The most common use for subqueries is filtering data in the WHERE clause of a SQL command or in the column list. Heres a subquery that retrieves a list of all customers who placed orders in the specified year. The main query uses that list to find the reverse - those who didn't order in that year:
This statement fetches basic client information. A subquery performs a lookup on another table to retrieve the birth_country description:
SELECT itt.family_nme, itt.given_nme, itt.birth_dt, (SELECT english_desc FROM vcode_value WHERE code_value_id = itt.birth_country_cd) AS 'birth_country' FROM identity_tombstone_id AS itt;
The above statement can also be said to contain a correlated subquery because it refers back to the outer main query.
Compare the previous two queries to the following one:
SELECT u.last_name, a.state FROM User AS u, (SELECT user_id, state FROM Address) AS a WHERE u.id = a.user_id;
The above query selects directly from the derived table created by the subquery. Of course, in the simple example above, you could just as easily define the join without the derived table, but as well see in the next section, there are other, more complex criteria, which call for the use of one or more derived tables.
When to Use Derived Tables
From my experience, the best candidates for derived tables are queries that incorporate some sort of grouping. In other words, those that include a GROUP BY clause. Consider the following tables:
Now well write a SELECT statement to generate a report that tallies the total number of bonuses that each employee received last year:
SELECT e.name, e.salary, COUNT(b.bonus_id) AS TotalBonuses FROM employees e LEFT OUTER JOIN bonuses b ON e.id = b.emp_id WHERE YEAR(b.award_date) = 2009 GROUP BY e.id;
Our 2009 bonus report looks quite good, except that employees who did not receive any bonuses aren't being included (no Mary Parker). That may be problematic, because it would be useful to be able to identify employees who may be underperforming.
The reason that the above query doesnt return all of the employees has to do with the way that the WHERE clause interacts with the GROUP BY clause. Using a derived table, we can reduce the bonuses table to a snapshot of itself containing only data from the year 2009.
SELECT e.name, e.salary, COUNT(b.bonus_id) AS 'Total Bonuses' FROM employees e LEFT OUTER JOIN (SELECT emp_id, bonus_id FROM bonuses WHERE YEAR(award_date) = 2009) AS b ON e.id = b.emp_id GROUP BY e.id;
The elusive Mary Parker is now visible.
Another use for derived tables is to replace correlated subqueries with joins. The following statement is supposed to return a more detailed bonus report for the months of January, February, and March of last year.
select emp.name, emp.salary, (select sum(amount) from bonuses where emp_id = emp.id and award_date between '2009-1-1' and '2009-1-31') jan_09_points, (select sum(amount) from bonuses where emp_id = emp.id and award_date between '2009-2-1' and '2009-2-28') feb_09_points, (select sum(amount) from bonuses where emp_id = emp.id and award_date between '2009-3-1' and '2009-3-31') mar_09_points from employees emp order by emp.name;
Unfortunately, the writer of the query chose to use subqueries in the field list. That isnt a great idea because the subqueries will be executed for every row in the employees table! For instance, if we have 1000 employees, then this statement will execute 3001 queries: one query on the employees table plus one thousand sum queries on each of the subqueries on the bonuses table). A better way to write this query is to use derived tables.
select emp_id, emp_name, emp_salary, jan_09_points, feb_09_points, mar_09_points from ( select emp.id as emp_id, emp.name as emp_name, emp.salary as emp_salary, sum(CASE WHEN month(award_date) = 1 THEN b.amount END) AS 'jan_09_points', sum(CASE WHEN month(award_date) = 2 THEN b.amount END) AS 'feb_09_points', sum(CASE WHEN month(award_date) = 3 THEN b.amount END) AS 'mar_09_points' from employees emp LEFT OUTER JOIN (SELECT * FROM bonuses WHERE YEAR(award_date) = 2009) AS b ON emp.id = b.emp_id group by emp.id) as all_employees order by emp_name;
Ultimately, the decision of whether to use derived tables, views, or temporary tables in your queries and stored procedures rests with you. The best way to be confident in your choice is to experiment with several approaches and time them with different volumes. Derived tables just might turn out to be your best choice in many cases.