Tips for Simplifying Crosstab Query Statements
February 5, 2010
Rob Gravelle takes a complex SQL statement and shows you several ways to re-write it that both simplifies and shortens it, while still yielding the same results.
After the All About the Crosstab Query article was published, some astute readers pointed out that my SQL statement was more complex than it needed to be. Their suggestions sent me in search of a simplified statement that would yield the same results. There were a couple of ways to shorten and simplify the SQL in some cases, substantially. We will be reviewing the most effective solutions today, which include using the CASE statement within the Count() function and the specialized WITH ROLLUP GROUP BY Modifier.
The CASE Statement
The CASE statement is a type of conditional construct much like an IF ELSE. It evaluates a search condition and executes the corresponding SQL statement list, depending on which condition evaluates to true. If no search condition matches, the statement list in the ELSE clause is executed instead. Each statement list may consist of one or more statements. Multiple statements are enclosed between the BEGIN and END delimiters.
One Statement, Two Styles
There are two forms of CASE statement, depending on whether you want to evaluate an object - such as a column or variable - or the results of an expression. Here is the syntax for evaluating an object:
CASE case_value WHEN when_value THEN statement_list [WHEN when_value THEN statement_list] ... [ELSE statement_list] END CASE
Perhaps the simplest use for the CASE statement is to categorize row data into categories. In the following example, values in the term_number column are converted to their associated description so that the results are meaningful to the end client. Notice that there is no ELSE clause, because we never expect a value outside of the 1 to 4 range:
SELECT CASE term_number WHEN 1 THEN 'Fall' WHEN 2 THEN 'Winter' WHEN 3 THEN 'Spring' WHEN 4 THEN 'Summer' END AS 'Term' FROM class_schedule;
Another good time to use a CASE statement is to choose a column based on a variables value, such as a stored proc input parameter. Here, a language code is used to choose between an English and French description. The ELSE is used to good effect so that any code other than F will default to English:
SELECT CASE @language_cd WHEN 'F' THEN french_desc ELSE english_desc END AS 'type_desc' FROM vcode_value WHERE code_value_id = assistance_request_type_cd;
The other form of the CASE statement, which uses a search condition, is found most often because of its greater flexibility. Here is the syntax for it:
SELECT CASE WHEN search_condition THEN statement_list [WHEN search_condition THEN statement_list] ... [ELSE statement_list] END CASE
Some Examples using a Search Condition
One common use of the CASE statement is to handle Nulls or empty values. In this instance, a value of Not supplied is provided as a default value:
SELECT family_name, given_name, CASE WHEN state IS NULL OR state = '' THEN 'Not supplied' ELSE state END FROM clients
There are many times where either style can be utilized, but one is more readable and efficient to write than the other. For instance, we could easily replace the previous example using a search condition:
SELECT CASE WHEN @language_cd = 'F' THEN french_desc ELSE english_desc END AS 'type_desc' FROM vcode_value WHERE code_value_id = assistance_request_type_cd;
Both work equally well, but I personally find the search condition syntax to be slightly more readable.
If there is an easy rule of thumb that I can suggest for deciding between styles, it would be that the first one is best for evaluating the same object against multiple possible values, such as we did with the school terms. Hence:
SELECT CASE WHEN row_value = 'A' THEN 'VALUE A' WHEN row_value = 'B' THEN 'VALUE B' WHEN row_value = 'C' THEN 'VALUE C' WHEN row_value = 'N' THEN 'VALUE N' ... --is perhaps better expressed as SELECT CASE row_value WHEN 'A' THEN 'VALUE A' WHEN 'B' THEN 'VALUE B' WHEN 'C' THEN 'VALUE C' WHEN 'N' THEN 'VALUE N' ...
The latter version is definitely more succinct.
Supplying the Results of a CASE Statement to the Count Function
The Count() function is most commonly used with an asterisk (*), which tells it to count entire rows. Another usage is to pass in an individual field. Doing so will cause the function to count the number of non-NULL instances of that particular column. This is the key to using a CASE statement with Count(), as we can use it to select the FEE_NUMBER values where the REGION_CODE matches a value of 01 to 05. Here is the simplified SQL, without the final TOTALS row:
SELECT MONTHNAME(CREATION_DATE) AS 'Month', COUNT(CASE WHEN REGION_CODE ='01' THEN FEE_NUMBER END) AS 'REGION 1', COUNT(CASE WHEN REGION_CODE ='02' THEN FEE_NUMBER END) AS 'REGION 2', COUNT(CASE WHEN REGION_CODE ='03' THEN FEE_NUMBER END) AS 'REGION 3', COUNT(CASE WHEN REGION_CODE ='04' THEN FEE_NUMBER END) AS 'REGION 4', COUNT(CASE WHEN REGION_CODE ='05' THEN FEE_NUMBER END) AS 'REGION 5', COUNT(*) AS 'TOTAL' FROM TA_CASES CA WHERE YEAR(CREATION_DATE)=1998 GROUP BY MONTH(CREATION_DATE);
The WITH ROLLUP Modifier
Adding the WITH ROLLUP modifier to the GROUP BY clause will append an additional row to the result set which sums all columns. ROLLUP thus allows you to summarize data at multiple levels with a single query. The grand total super-aggregate lines can be identified by the Null values in the GROUPed BY fields. Hence the following query:
>SELECT year, SUM(profit) FROM sales GROUP BY year WITH ROLLUP;
...would produce something like the following:
+------+-------------+ | year | SUM(profit) | +------+-------------+ | 2000 | 4525 | | 2001 | 3010 | | NULL | 7535 | +------+-------------+
We can replace the Null with a more suitable value such as TOTAL, but it isnt easy, as the Null row is inserted late in query processing. Another challenge is presented by working with grouping on the output of date functions. For instance, grouping on the MONTHNAME() sorts the rows in alphabetical order, rather than chronological.
The answer to both these issues is a two pass approach. The first selects all of the fields that we want, plus the month number, for sorting. We can select from it by placing the code after the FROM of a second query. A CASE statement selects between the Month column and the TOTAL label:
SELECT CASE WHEN Month_Num IS NULL THEN 'TOTAL' ELSE Month END AS 'Month', REGION_1 AS 'REGION 1', REGION_2 AS 'REGION 2', REGION_3 AS 'REGION 3', REGION_4 AS 'REGION 4', REGION_5 AS 'REGION 5', TOTAL FROM (SELECT MONTH(CREATION_DATE) AS Month_Num, MONTHNAME(CREATION_DATE) AS Month, COUNT(CASE WHEN REGION_CODE ='01' THEN FEE_NUMBER END) AS REGION_1, COUNT(CASE WHEN REGION_CODE ='02' THEN FEE_NUMBER END) AS REGION_2, COUNT(CASE WHEN REGION_CODE ='03' THEN FEE_NUMBER END) AS REGION_3, COUNT(CASE WHEN REGION_CODE ='04' THEN FEE_NUMBER END) AS REGION_4, COUNT(CASE WHEN REGION_CODE ='05' THEN FEE_NUMBER END) AS REGION_5, COUNT(*) AS TOTAL FROM TA_CASES WHERE YEAR(CREATION_DATE)=1998 GROUP BY Month_Num WITH ROLLUP) AS CA;
The output of the above query is a complete crosstab, including chronological row sorting and totals:
Cutting down the size of the SQL will come in useful when generating the code within a stored proc. Remember that the query itself is really quite a simple one as it only hits one table, only pivots between two fields. As we shall soon see, not every crosstab query is as straightforward.