When I first took on the role of reports
designer for our systems, my first instinct was to try fetching all of the
required fields in one huge SQL statement. I soon learned that there are
better and easier ways to structure a query that will minimize the complexity
while raising our confidence in the data returned. I’ll be sharing a few
techniques here that will help simplify the writing of complex queries.
Part 1: Database Design
Structuring the database for easy querying
should always be one of your primary design goals. Unfortunately, if your IT
shop is anything like mine, data extraction seems to be the last thing on
anybody’s mind when the database is being designed. Rather, all the decisions
are based on how to best normalize the data structures and facilitate saving
the data from systems that interface with the database. Under those
circumstances, it’s only after the database design has long been finalized and
the tables have been populated with data that you realize the price that you
are now paying for leaving data extraction until the last task. Consequently,
query designers have to perform back flips to present the data in a meaningful
way.
Here are a few database design suggestions
for making querying a little easier:
Identify Every Entity by a Unique Numeric Code
Examine the following table designs and
notice that each has a numeric key code as the first column. These codes act
as a unique identifier for each row (distinct entity) in the table. While not
as intuitive to humans as text, numbers are a lot quicker for the database to
index and link related values together:
Employee Table
emp_cd last_name first_name phone_numberDependent Table
dependent_cd last_name first_name date_of_birthEmployee Dependent Relationship Table
emp_cd relationship_type_code dependent_cdRelationship Type Table
relationship_type_cd english_desc english_short_desc
To define a relationship for the Employee and
Dependent tables, we would need two additional ones: the Relationship table and
one containing the Relationship Types. You
could store the relationship information in either or both tables, but that may lead
to problems. First, if you only include relationship in one of the
tables, you won’t be able to get back to the other one! To illustrate, if you
were to add an emp_cd column to the Dependent table, you would be easily able
to retrieve the guardians of the dependents, but you would be at a disadvantage
to locate the dependents of a particular guardian as it would not contain any
information about related dependants. The second issue is that including the
foreign key in the table itself can cause a lot of duplication since the only
different field would be the foreign key. When you consider that the Employee table
might contain a dozen fields or more, that could quickly lead to Gigabytes of
wasted storage space, not to mention an overburdened server! For this reason,
it’s preferable to create a separate table containing only code values that
define the relationship.
The Relationship Type table below is what is known as a code
or lookup table. This is where common (shared ) information is stored. You could think of this table as a kind of data
dictionary. Some of the advantages to using code tables include:
- allowing us to use numeric code values in the relationship table
- promoting easy maintainability
-
providing a centralized place to store common
data
Here are the same four tables with
some sample data:
Employee Table
emp_cd last_name first_name phone_number
1 Smith Roger 555-1234
2 Blattel Mary 555-8877
3 Mercer Steven 555-9980Employee Dependent Relationship Table
emp_cd relationship_type_code dependent_cd
3 1 1
3 1 2
1 3 3Dependent Table
dependent_cd last_name first_name date_of_birth
1 Mercer Joey Feb 12, 1989
2 Mercer Cindy Aug 4, 1987
3 Smith Amy Dec 1, 1985Relationship Type Table
relationship_type_cd english_desc english_short_desc
1 father dad
2 mother mom
3 grand-father grandpa
4 grand-mother grandma
Consider Including Text Keys in Addition to Codes
After all that talk about using numeric
codes, I’m now going to encourage you to use text descriptor keys in your
queries! While numeric codes improve database retrieval performance, they are equally
detrimental to SQL development, as they hamper code readability. Take the
following query:
select english_desc
from vcode_table_data
where code_table_name = ‘routing_point’
and code_value_name = ‘routing_point_usa_port_of_entry’
Even without knowing anything about the database contents,
we can easily ascertain that the above statement is retrieving the description
for the USA Port of Entry from the Routing Point code table.
The descriptor fields described here should not be confused
with data fields, which contain values. Never use a data field to link
tables. Clients are notorious for changing the names of things and adding
alternate descriptions. In the Relationship Type table above, the description
fields look harmless enough. It may seem almost pointless to add another field
that duplicates the “father” and “mother” values. But were you to bend the
rules and use the english_desc field in your query to join tables instead, you
can be assured that the following week, a client will decide to capitalize the
descriptions and do so without your knowledge. Then, before you know it,
another client will complain that reports for a high level politician are
failing! Perhaps by the vivid details, you can tell that I’ve been there before.
It bears repeating that you should never use description fields to link tables,
and by “should” I mean “don’t”!
Part 2: Query Syntax
Use Table Aliases
The use of table aliases in queries is a hotly debated topic
among database administrators, but most agree that they do make complex queries
more readable. Which would you prefer to read: dozens of "TT."s or
"dbo.tblTrabeculectomyTechniques"? I thought so!
Beyond the readability issue, there are times that you must
use table aliases. The most common is where multiple referenced tables contain
the same column:
SELECT *
FROM inventory inv,
part p
WHERE inv.prod_id = 183 AND
inv.prod_id = p.prod_id;
Another, less common, use for table aliases are powerful for
complex queries that need to refer to the same table twice. A reason to access
a table twice would be to find customers who share the same surname:
SELECT c1.cust_id,
c2.cust_id
FROM customer c1,
customer c2
WHERE c1.surname = c2.surname AND
c1.cust_id != c2.cust_id;
The last line is important and makes sure that we’re not
pointing back to the same person.
Avoid Table Scans by Properly Utilizing Index Fields
Indexes are used to find rows with specific column values
quickly. Without an index, MySQL has to scan every row in the table to find the
relevant rows. If a table has 1,000 rows, using an index is at least 100
times faster than reading sequentially.
To properly utilize an indexed column, don’t include it
within an expression that refers to only part of the column value. Instead, use
the entire column for comparison. For example, one way to select rows
containing date values from the year 1986 and up is as follows:
SELECT * FROM t WHERE YEAR(d) >= 1986;
In this case, the value of YEAR(d) must be evaluated for
every row in the table, so the index cannot be used. Instead, write the query
like this:
SELECT * FROM t WHERE YEAR(d) >= ‘1986-01-01’;
In the rewritten expression, the
indexed column stands by itself on one side of the comparison and MySQL can
apply the index to optimize the query.
I hope that the tips presented here today will steer you in
the right direction. This list is by no means exhaustive. In fact, it’s only
really the tip of the iceberg. We’ll be returning to this subject at a later
date, to supplement this list.