Tips for Managing Complex Queries
September 11, 2009
When I first took on the role of reports designer for our systems, my first instinct was to try fetching all of the required fields in one huge SQL statement. I soon learned that there are better and easier ways to structure a query that will minimize the complexity while raising our confidence in the data returned. Ill be sharing a few techniques here that will help simplify the writing of complex queries.
Part 1: Database Design
Structuring the database for easy querying should always be one of your primary design goals. Unfortunately, if your IT shop is anything like mine, data extraction seems to be the last thing on anybodys mind when the database is being designed. Rather, all the decisions are based on how to best normalize the data structures and facilitate saving the data from systems that interface with the database. Under those circumstances, its only after the database design has long been finalized and the tables have been populated with data that you realize the price that you are now paying for leaving data extraction until the last task. Consequently, query designers have to perform back flips to present the data in a meaningful way.
Here are a few database design suggestions for making querying a little easier:
Identify Every Entity by a Unique Numeric Code
Examine the following table designs and notice that each has a numeric key code as the first column. These codes act as a unique identifier for each row (distinct entity) in the table. While not as intuitive to humans as text, numbers are a lot quicker for the database to index and link related values together:
Employee Table emp_cd last_name first_name phone_number Dependent Table dependent_cd last_name first_name date_of_birth Employee Dependent Relationship Table emp_cd relationship_type_code dependent_cd Relationship Type Table relationship_type_cd english_desc english_short_desc
To define a relationship for the Employee and Dependent tables, we would need two additional ones: the Relationship table and one containing the Relationship Types. You could store the relationship information in either or both tables, but that may lead to problems. First, if you only include relationship in one of the tables, you wont be able to get back to the other one! To illustrate, if you were to add an emp_cd column to the Dependent table, you would be easily able to retrieve the guardians of the dependents, but you would be at a disadvantage to locate the dependents of a particular guardian as it would not contain any information about related dependants. The second issue is that including the foreign key in the table itself can cause a lot of duplication since the only different field would be the foreign key. When you consider that the Employee table might contain a dozen fields or more, that could quickly lead to Gigabytes of wasted storage space, not to mention an overburdened server! For this reason, its preferable to create a separate table containing only code values that define the relationship.
The Relationship Type table below is what is known as a code or lookup table. This is where common (shared ) information is stored. You could think of this table as a kind of data dictionary. Some of the advantages to using code tables include:
Here are the same four tables with some sample data:
Employee Table emp_cd last_name first_name phone_number 1 Smith Roger 555-1234 2 Blattel Mary 555-8877 3 Mercer Steven 555-9980 Employee Dependent Relationship Table emp_cd relationship_type_code dependent_cd 3 1 1 3 1 2 1 3 3 Dependent Table dependent_cd last_name first_name date_of_birth 1 Mercer Joey Feb 12, 1989 2 Mercer Cindy Aug 4, 1987 3 Smith Amy Dec 1, 1985 Relationship Type Table relationship_type_cd english_desc english_short_desc 1 father dad 2 mother mom 3 grand-father grandpa 4 grand-mother grandma
Consider Including Text Keys in Addition to Codes
After all that talk about using numeric codes, Im now going to encourage you to use text descriptor keys in your queries! While numeric codes improve database retrieval performance, they are equally detrimental to SQL development, as they hamper code readability. Take the following query:
select english_desc from vcode_table_data where code_table_name = 'routing_point' and code_value_name = 'routing_point_usa_port_of_entry'
Even without knowing anything about the database contents, we can easily ascertain that the above statement is retrieving the description for the USA Port of Entry from the Routing Point code table.
The descriptor fields described here should not be confused with data fields, which contain values. Never use a data field to link tables. Clients are notorious for changing the names of things and adding alternate descriptions. In the Relationship Type table above, the description fields look harmless enough. It may seem almost pointless to add another field that duplicates the father and mother values. But were you to bend the rules and use the english_desc field in your query to join tables instead, you can be assured that the following week, a client will decide to capitalize the descriptions and do so without your knowledge. Then, before you know it, another client will complain that reports for a high level politician are failing! Perhaps by the vivid details, you can tell that Ive been there before. It bears repeating that you should never use description fields to link tables, and by should I mean dont!
Part 2: Query Syntax
Use Table Aliases
The use of table aliases in queries is a hotly debated topic among database administrators, but most agree that they do make complex queries more readable. Which would you prefer to read: dozens of "TT."s or "dbo.tblTrabeculectomyTechniques"? I thought so!
Beyond the readability issue, there are times that you must use table aliases. The most common is where multiple referenced tables contain the same column:
SELECT * FROM inventory inv, part p WHERE inv.prod_id = 183 AND inv.prod_id = p.prod_id;
Another, less common, use for table aliases are powerful for complex queries that need to refer to the same table twice. A reason to access a table twice would be to find customers who share the same surname:
SELECT c1.cust_id, c2.cust_id FROM customer c1, customer c2 WHERE c1.surname = c2.surname AND c1.cust_id != c2.cust_id;
The last line is important and makes sure that were not pointing back to the same person.
Avoid Table Scans by Properly Utilizing Index Fields
Indexes are used to find rows with specific column values quickly. Without an index, MySQL has to scan every row in the table to find the relevant rows. If a table has 1,000 rows, using an index is at least 100 times faster than reading sequentially.
To properly utilize an indexed column, don't include it within an expression that refers to only part of the column value. Instead, use the entire column for comparison. For example, one way to select rows containing date values from the year 1986 and up is as follows:
SELECT * FROM t WHERE YEAR(d) >= 1986;
In this case, the value of YEAR(d) must be evaluated for every row in the table, so the index cannot be used. Instead, write the query like this:
SELECT * FROM t WHERE YEAR(d) >= '1986-01-01';
In the rewritten expression, the indexed column stands by itself on one side of the comparison and MySQL can apply the index to optimize the query.
I hope that the tips presented here today will steer you in the right direction. This list is by no means exhaustive. In fact, its only really the tip of the iceberg. Well be returning to this subject at a later date, to supplement this list.