SELECT DISTINCT: A SQL Case StudyNovember 23, 2004 By Tom Copeland
There is an understanding in the database world that using a "SELECT DISTINCT" SQL query is not a good idea, because it is essentially getting duplicate rows out of the database and then discarding them. Usually it is better to rearrange the "WHERE" clause in the query to only get the rows you need. One of these cases came up in a 3-way join query in a PostgreSQL database I administer. This article gives a step-by-step "case study" of analyzing a query in PostgreSQL and how to ensure that your SQL rewrite is actually paying off. Hunting for DISTINCTsA After reading an
excellent article by Stephane Faroult on this topic, I went rummaging
around amongst the SQL queries inside the open source team collaboration tool GForge. Since I help administer a GForge server (RubyForge) that hosts over 400 projects and
1100 users, I am always happy to run across ways to help improve its SQL
efficiency. Hoping to apply my newfound knowledge, I used the UNIX tool $ egrep -r DISTINCT * common/search/ArtifactSearchQuery.class: $sql = 'SELECT DISTINCT ON (a.group_artifact_id,a.artifact_id) a.group_artifact_id,a.artifact_id,a.summary,a.open_date,users.realname ' common/search/ArtifactSearchQuery.class: $sql = 'SELECT DISTINCT ON (a.group_artifact_id,a.artifact_id) a.group_artifact_id, a.artifact_id ' [...many more results skipped...] A likely target turned up quickly in one of the hourly cron jobs. GForge has
a "mass mailing" component which is used for sending notices to
various groups of people (project administrators, site users, etc.) when
downtime is being scheduled or a new service is announced. The mass mailer
script was using a The query revealedHere is the SQL query that looked like a likely target. GForge aficionados will notice that I have selected just one of our several templatized variations on this query, and that for the purposes of this article I dropped the "paging" functionality that is part of the query since it is irrelevant here. At any rate, this query selects the administrators of all the active projects: SELECT DISTINCT u.user_id, u.user_name, u.realname, u.email, u.confirm_hash FROM users u, user_group ug, groups g WHERE u.status='A' AND u.user_id=ug.user_id AND ug.admin_flags='A' AND g.status='A' AND g.group_id=ug.group_id ORDER BY u.user_id; It looks a bit complicated since it is doing a three-way join, but it is
really fairly straightforward. GForge has a Following the standard SELECT u.user_id, u.user_name, u.realname, u.email FROM users u WHERE u.status = 'A' AND u.user_id IN ( SELECT ug.user_id FROM user_group ug, groups g WHERE ug.admin_flags='A' AND g.status = 'A' AND g.group_id = ug.group_id ) ORDER BY u.user_id; So now there is a subquery that gets the At this point, we want to ensure that the queries return equivalent result
sets. It is safe for us to run this on a live data set since this is a One way to test query equivalence is to pipe the output of both commands to
a file and use the UNIX $ ./distinct.sh > res1.txt $ ./subquery.sh > res2.txt $ diff res1.txt res2.txt No difference at all; that is what we want to see! Now we know that the queries are functionally equivalent and we can move on to the fun part - performance checking. Analyzing with EXPLAINWhen tuning queries like this, it is best to check the before and after
performance of each query to ensure you are actually getting the expected
performance gains. GForge runs on the excellent open source database PostgreSQL, which has a fine query analysis
tool called [tom@rubyforge tom]$ cat check.sh echo "DISTINCT" psql -U gforge gforge -c "explain analyze We can see that the new version is quite a bit faster than the old version. However,
was this just a fluke? Let's run it a few times to make sure. An easy way to do
this is to use the UNIX
The screen clears and each query is run once every second. Since the output is the same each time, the time in milliseconds stays in the same place on the screen, so we can watch it for a while to make sure the results are consistent. On RubyForge the query times did not vary by more than a millisecond or so for 50-60 repetitions, so it seems like a stable speed increase. Now the obvious thing to do is to contribute this change back to the GForge project so everyone can benefit. Best of all, since we analyzed the change, we can present a patch with solid evidence that it will in fact improve performance. ConclusionsIn summary, a SQL query that works can often be made more efficient. There
are general principles, which can be followed to attempt to increase query
performance, although each individual case should be checked to ensure it is
not an exception to the rule. Most databases have utilities to measure query
efficiency; with PostgreSQL, the CreditsThanks to Stephane Faroult, who wrote the article that prompted this search for low-hanging fruit. Thanks also to Merlin Moncure who reviewed this article and made many helpful suggestions. |