T-SQL Programming Part 11 – Using SQL Server’s GROUP BY clause to Summarize your Data

There are times when you want to have SQL Server return an aggregated result set, instead of a detailed result set. SQL Server has the GROUP BY clause that provides you a way to aggregate your SQL Server data. The GROUP BY clause allows you to group data on a single column, multiple columns, or even expressions. In this article I will be discussing how to use the GROUP by clause to summarize your data.

GROUP BY Using Single ColumnA

The GROUP BY clause is used to summarize data based on some grouping criteria. For the first example I am going to group data based on a single column from the AdventureWork2012 database table named Sales.SalesOrderDetail. This example and most of the rest of my examples use the AdventureWorks2012 database. If you want to follow along and run my code you can download the AdventureWorks2012 database from the following location: http://msftdbprodsamples.codeplex.com/releases/view/55330

Below is the code for my first example, which groups data based on the single column named CarrierTrackingNumber:

USE AdventureWorks2012;
GO
SELECT CarrierTrackingNumber
,SUM(LineTotal) AS SummarizedLineTotal
FROM AdventureWorks2012.Sales.SalesOrderDetail
GROUP BY CarrierTrackingNumber;

When I run this code I get 3,807 records returned. Here are the first 5 values returned from that larger record set:

CarrierTrackingNumber LineTotal
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
6E46-440A-B5 2133.170700
B501-448E-96 4085.409800
8551-4CDF-A1 72616.524200
B65C-4867-86 714.704300
99CE-4ADA-B1 16185.429200

In my example above, my GROUP BY clause controlled what column was used to aggregate the AdventureWorks2012.Sales.SalesOrderDetail data. In my example I summarize the data based on the CarrierTrackingNumber. When you group your data the only columns that are valid in the selection list are columns that can be aggregated, plus columns used on the GROUP BY clause. In my example I aggregated the LineTotal amount using the SUM function. For the aggregated value I set a column alias of SummarizedLineTotal.

If I wanted to bring back just a set of aggregated values based on CarrierTrackingNumber then I could constrain my query above with a WHERE clause, as I have done below:

USE AdventureWorks2012;
GO
SELECT CarrierTrackingNumber
,SUM(LineTotal) AS SummarizedLineTotal
FROM AdventureWorks2012.Sales.SalesOrderDetail
WHERE CarrierTrackingNumber = '48F2-4141-9A'
GROUP BY CarrierTrackingNumber;

Here I have added a WHERE constraint to my original query. I constrained my query to a single CarrierTrackingNumber value. When I run this code it brings back a single summarized LineTotal amount for the records that have a CarrierTrackingNumber value of ’48F2-4141-9A’. The WHERE constraint filters out rows prior to them being aggregated.

Grouping by Multiple Columns

There are times when you might want to group your data by more than a single column. Below is an example where I grouped some data by multiple columns.

SELECT D.ProductID
, H.OrderDate
, SUM(LineTotal) AS SummarizedLineTotal
FROM AdventureWorks2012.Sales.SalesOrderDetail D
JOIN AdventureWorks2012.Sales.SalesOrderHeader H
ON D.SalesOrderId = H.SalesOrderID
GROUP BY ProductID, OrderDate;

This query returned 26,878 rows. Here is the first part of the results from my query above.

ProductID OrderDate LineTotal
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
714 2008-05-21 00:00:00.000 99.980000
859 2007-11-03 00:00:00.000 48.980000
923 2007-11-23 00:00:00.000 14.970000
712 2007-12-22 00:00:00.000 62.930000
795 2007-10-14 00:00:00.000 2443.350000
950 2007-07-01 00:00:00.000 2462.304000
795 2007-11-06 00:00:00.000 2443.350000
877 2007-11-19 00:00:00.000 15.900000
713 2007-10-01 00:00:00.000 99.980000
860 2008-05-31 00:00:00.000 48.980000
961 2008-05-01 00:00:00.000 36242.120880

In this example the ProductID and OrderDate columns are used in the GROUP BY clause. SQL Server summarizes the LineTotal based on unique values of both ProductID and OrderDate, and produces the summarized column alias SummarizedLineTotal. If you review the output column values you can see when SQL Server groups the data is doesn’t return the result set in any particular order. If you need to have the aggregated values returned is sorted order you then need to have an ORDER BY clause, like I have done in the following code:

SELECT D.ProductID
, H.OrderDate
, SUM(LineTotal) AS SummarizedLineTotal
FROM AdventureWorks2012.Sales.SalesOrderDetail D
JOIN AdventureWorks2012.Sales.SalesOrderHeader H
ON D.SalesOrderId = H.SalesOrderID
GROUP BY ProductID, OrderDate
ORDER BY SummarizedLineTotal DESC;

Here I have ordered the results by the column alias SummorizedLineTotal. This column is the aggregated value that was derived by summing the LineTotal column based on the GROUP BY columns. I ordered the results in descending order based on the calculated SummorizedLineTotal value. If you run this code you will be able to see which ProductID and OrderDate had the highest summarized LineTotal amount.

Grouping Data that has NULL Values

There are times when you need to group data that has records that contains a NULL value. When you do this SQL Server summarize your data assuming all NULL values are the same. Let’s look at an example of this by review the following code:

CREATE TABLE NullGroupBy (OrderDate date, Amount Int);
INSERT INTO NullGroupBy values (NULL,100),
('10-30-2014',100),
('10-31-2014',100),
(NULL,100);
SELECT OrderDate, SUM(Amount) as TotalAmount
FROM NullGroupBy
GROUP BY OrderDate;
DROP TABLE NullGroupBy;
When I run this code I get the following output:
OrderDate TotalAmount
-- -- -- -- -- -- -- -- -- -- -
NULL 200
2014-10-30 100
2014-10-31 100

In my code here I first created and populated a table named NullGroupBy. In this table I placed four different rows. The first and last rows have a value of NULL from the OrderDate, and the other two columns have different OrderDate values. As you can see by reviewing the output above, SQL Server rolls-up the two rows that contain a NULL OrderDate into a single summarized row.

Using an Expression in the GROUP BY Clause

There are times when you don’t want to group by a specific column but on an expression. SQL Server also allows you to specify an expression for the columns in the GROUP BY clause, as I have down in the code below:

SELECT CONVERT(CHAR(7),H.OrderDate,120) AS [YYYY-MM]
, SUM(LineTotal) AS SummarizedLineTotal
FROM AdventureWorks2012.Sales.SalesOrderDetail D
JOIN AdventureWorks2012.Sales.SalesOrderHeader H
ON D.SalesOrderId = H.SalesOrderID
GROUP BY CONVERT(CHAR(7),H.OrderDate,120)
ORDER BY SummarizedLineTotal DESC;

This code groups the data by the year and month of the OrderDate. By using the expression CONVERT(CHAR(7),H.OrderDate,120) I told SQL Server to take the first seven characters of the OrderDate ODBC canonical date format, which is the YYYY-MM portion of the OrderDate. Being able to summarize my data based on this expression allowed me to determine the total SummarizeLineTotal value for a given year and month. By using an expression in the GROUP BY clause and ordering based on the LineTotal amount I can determine which year and month had the greatest and least SummarizeLineTotal amount.

Filtering Data Using HAVING Clause

Another valuable clause that can be used with the GROUP BY clause is the HAVING clause. Using the HAVING clause allows you to filter out rows that meet the HAVING clause expression value. When I used the WHERE clause above it filtered out rows prior to them being aggregated. The HAVING clause allows you to filter out the aggregated rows based on a criteria. To better understand review the code below:

SELECT D.ProductID
, H.OrderDate
, SUM(LineTotal) AS SummarizedLineTotal
FROM AdventureWorks2012.Sales.SalesOrderDetail D
JOIN AdventureWorks2012.Sales.SalesOrderHeader H
ON D.SalesOrderId = H.SalesOrderID
GROUP BY ProductID, OrderDate
HAVING SUM(LineTotal) > 200000
ORDER BY SummarizedLineTotal DESC;

In this code my HAVING clause specifies “SUM(LineTotal) > 200000”. This HAVING clause keeps only those aggregated LineTotal values (column alias SummarizedLineTotal) that summarize to be greater than 200,000. By using the HAVING clause my query returned only a single row that had an aggregated SummarizedLineTotal that was greater than 200,000. The HAVING clause allows SQL Server to return only those aggregated rows that meet to the HAVING clause criteria.

Summary

Many applications require that data be rolled up prior to being presented. Using the GROUP BY clause is the mechanism that SQL Server provides to summarize data. The GROUP BY clause allows you to also filter out summarized data by using the HAVING clause. Hopefully next time you need to summarize some data you’ll be able to handle it with ease.

See all articles by Greg Larsen

Gregory Larsen
Gregory Larsen
Gregory A. Larsen is a DBA at Washington State Department of Health (DOH). Greg is responsible for maintaining SQL Server and other database management software. Greg works with customers and developers to design and implement database changes, and solve database/application related problems. Greg builds homegrown solutions to simplify and streamline common database management tasks, such as capacity management.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles