There are times when you want to have SQL Server return an aggregated result set, instead of a detailed result set. SQL Server has the GROUP BY clause that provides you a way to aggregate your SQL Server data. The GROUP BY clause allows you to group data on a single column, multiple columns, or even expressions. In this article I will be discussing how to use the GROUP by clause to summarize your data.
GROUP BY Using Single ColumnA
The GROUP BY clause is used to summarize data based on some grouping criteria. For the first example I am going to group data based on a single column from the AdventureWork2012 database table named Sales.SalesOrderDetail. This example and most of the rest of my examples use the AdventureWorks2012 database. If you want to follow along and run my code you can download the AdventureWorks2012 database from the following location: http://msftdbprodsamples.codeplex.com/releases/view/55330
Below is the code for my first example, which groups data based on the single column named CarrierTrackingNumber:
USE AdventureWorks2012; GO SELECT CarrierTrackingNumber ,SUM(LineTotal) AS SummarizedLineTotal FROM AdventureWorks2012.Sales.SalesOrderDetail GROUP BY CarrierTrackingNumber;
When I run this code I get 3,807 records returned. Here are the first 5 values returned from that larger record set:
CarrierTrackingNumber LineTotal -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 6E46-440A-B5 2133.170700 B501-448E-96 4085.409800 8551-4CDF-A1 72616.524200 B65C-4867-86 714.704300 99CE-4ADA-B1 16185.429200
In my example above, my GROUP BY clause controlled what column was used to aggregate the AdventureWorks2012.Sales.SalesOrderDetail data. In my example I summarize the data based on the CarrierTrackingNumber. When you group your data the only columns that are valid in the selection list are columns that can be aggregated, plus columns used on the GROUP BY clause. In my example I aggregated the LineTotal amount using the SUM function. For the aggregated value I set a column alias of SummarizedLineTotal.
If I wanted to bring back just a set of aggregated values based on CarrierTrackingNumber then I could constrain my query above with a WHERE clause, as I have done below:
USE AdventureWorks2012; GO SELECT CarrierTrackingNumber ,SUM(LineTotal) AS SummarizedLineTotal FROM AdventureWorks2012.Sales.SalesOrderDetail WHERE CarrierTrackingNumber = '48F2-4141-9A' GROUP BY CarrierTrackingNumber;
Here I have added a WHERE constraint to my original query. I constrained my query to a single CarrierTrackingNumber value. When I run this code it brings back a single summarized LineTotal amount for the records that have a CarrierTrackingNumber value of ’48F2-4141-9A’. The WHERE constraint filters out rows prior to them being aggregated.
Grouping by Multiple Columns
There are times when you might want to group your data by more than a single column. Below is an example where I grouped some data by multiple columns.
SELECT D.ProductID , H.OrderDate , SUM(LineTotal) AS SummarizedLineTotal FROM AdventureWorks2012.Sales.SalesOrderDetail D JOIN AdventureWorks2012.Sales.SalesOrderHeader H ON D.SalesOrderId = H.SalesOrderID GROUP BY ProductID, OrderDate;
This query returned 26,878 rows. Here is the first part of the results from my query above.
ProductID OrderDate LineTotal -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - 714 2008-05-21 00:00:00.000 99.980000 859 2007-11-03 00:00:00.000 48.980000 923 2007-11-23 00:00:00.000 14.970000 712 2007-12-22 00:00:00.000 62.930000 795 2007-10-14 00:00:00.000 2443.350000 950 2007-07-01 00:00:00.000 2462.304000 795 2007-11-06 00:00:00.000 2443.350000 877 2007-11-19 00:00:00.000 15.900000 713 2007-10-01 00:00:00.000 99.980000 860 2008-05-31 00:00:00.000 48.980000 961 2008-05-01 00:00:00.000 36242.120880
In this example the ProductID and OrderDate columns are used in the GROUP BY clause. SQL Server summarizes the LineTotal based on unique values of both ProductID and OrderDate, and produces the summarized column alias SummarizedLineTotal. If you review the output column values you can see when SQL Server groups the data is doesn’t return the result set in any particular order. If you need to have the aggregated values returned is sorted order you then need to have an ORDER BY clause, like I have done in the following code:
SELECT D.ProductID , H.OrderDate , SUM(LineTotal) AS SummarizedLineTotal FROM AdventureWorks2012.Sales.SalesOrderDetail D JOIN AdventureWorks2012.Sales.SalesOrderHeader H ON D.SalesOrderId = H.SalesOrderID GROUP BY ProductID, OrderDate ORDER BY SummarizedLineTotal DESC;
Here I have ordered the results by the column alias SummorizedLineTotal. This column is the aggregated value that was derived by summing the LineTotal column based on the GROUP BY columns. I ordered the results in descending order based on the calculated SummorizedLineTotal value. If you run this code you will be able to see which ProductID and OrderDate had the highest summarized LineTotal amount.
Grouping Data that has NULL Values
There are times when you need to group data that has records that contains a NULL value. When you do this SQL Server summarize your data assuming all NULL values are the same. Let’s look at an example of this by review the following code:
CREATE TABLE NullGroupBy (OrderDate date, Amount Int); INSERT INTO NullGroupBy values (NULL,100), ('10-30-2014',100), ('10-31-2014',100), (NULL,100); SELECT OrderDate, SUM(Amount) as TotalAmount FROM NullGroupBy GROUP BY OrderDate; DROP TABLE NullGroupBy; When I run this code I get the following output: OrderDate TotalAmount -- -- -- -- -- -- -- -- -- -- - NULL 200 2014-10-30 100 2014-10-31 100
In my code here I first created and populated a table named NullGroupBy. In this table I placed four different rows. The first and last rows have a value of NULL from the OrderDate, and the other two columns have different OrderDate values. As you can see by reviewing the output above, SQL Server rolls-up the two rows that contain a NULL OrderDate into a single summarized row.
Using an Expression in the GROUP BY Clause
There are times when you don’t want to group by a specific column but on an expression. SQL Server also allows you to specify an expression for the columns in the GROUP BY clause, as I have down in the code below:
SELECT CONVERT(CHAR(7),H.OrderDate,120) AS [YYYY-MM] , SUM(LineTotal) AS SummarizedLineTotal FROM AdventureWorks2012.Sales.SalesOrderDetail D JOIN AdventureWorks2012.Sales.SalesOrderHeader H ON D.SalesOrderId = H.SalesOrderID GROUP BY CONVERT(CHAR(7),H.OrderDate,120) ORDER BY SummarizedLineTotal DESC;
This code groups the data by the year and month of the OrderDate. By using the expression CONVERT(CHAR(7),H.OrderDate,120) I told SQL Server to take the first seven characters of the OrderDate ODBC canonical date format, which is the YYYY-MM portion of the OrderDate. Being able to summarize my data based on this expression allowed me to determine the total SummarizeLineTotal value for a given year and month. By using an expression in the GROUP BY clause and ordering based on the LineTotal amount I can determine which year and month had the greatest and least SummarizeLineTotal amount.
Filtering Data Using HAVING Clause
Another valuable clause that can be used with the GROUP BY clause is the HAVING clause. Using the HAVING clause allows you to filter out rows that meet the HAVING clause expression value. When I used the WHERE clause above it filtered out rows prior to them being aggregated. The HAVING clause allows you to filter out the aggregated rows based on a criteria. To better understand review the code below:
SELECT D.ProductID , H.OrderDate , SUM(LineTotal) AS SummarizedLineTotal FROM AdventureWorks2012.Sales.SalesOrderDetail D JOIN AdventureWorks2012.Sales.SalesOrderHeader H ON D.SalesOrderId = H.SalesOrderID GROUP BY ProductID, OrderDate HAVING SUM(LineTotal) > 200000 ORDER BY SummarizedLineTotal DESC;
In this code my HAVING clause specifies “SUM(LineTotal) > 200000”. This HAVING clause keeps only those aggregated LineTotal values (column alias SummarizedLineTotal) that summarize to be greater than 200,000. By using the HAVING clause my query returned only a single row that had an aggregated SummarizedLineTotal that was greater than 200,000. The HAVING clause allows SQL Server to return only those aggregated rows that meet to the HAVING clause criteria.
Many applications require that data be rolled up prior to being presented. Using the GROUP BY clause is the mechanism that SQL Server provides to summarize data. The GROUP BY clause allows you to also filter out summarized data by using the HAVING clause. Hopefully next time you need to summarize some data you’ll be able to handle it with ease.