SQL Server doesn’t just serve as a storage engine; it’s also a powerful analytical engine. With the new analytic enhancements in SQL Server Denali CTP3, SQL Server is taking data analysis in SQL Server to a new level. If you are a SQL Server developer who performs data analysis and research every day, you will certainly be excited to see the new analytic enhancements in SQL Server Denali CTP3.
The OVER clause that was introduced in SQL Server 2005 has been enhanced to support moving average or cumulative total. New statistical distribution functions have been introduced, such as CUME_DIST to calculate cumulative distribution. LEAD and LAG functions have also been added to provide column-based analysis. With the addition of the new Columnstore index, data analysis on a subset of columns, such as time-series analysis, on tables of hundreds of GBs or TBs, can be done much faster. I will address these enhancements in this series of articles.
To demonstrate the new features, I set up a database called Equity with a table EquityPrice. The table contains the historical stock prices from Jan 4, 2010 to Aug 5, 2011, of four stock tickers, AAPL, CSCO, HPQ and IBM (data source: Yahoo! Finance).
IF OBJECT_ID('dbo.EquityPrice', 'U') IS NOT NULL
DROP TABLE [dbo].[EquityPrice]
CREATE TABLE [dbo].[EquityPrice](
[Ticker] varchar(10) NOT NULL,
[TradeDate] date NOT NULL,
[Open] decimal(9,2) NULL,
[High] decimal(9,2) NULL,
[Low] decimal(9,2) NULL,
[Close] decimal(9,2) NULL,
[Volume] bigint NULL,
[AdjClose] decimal(9,2) NULL,
CONSTRAINT PK_EquityPrice Primary key ([Ticker], [TradeDate])
) ON [PRIMARY]
You can download the backup of the database here.
OVER clause enhancement - ROW or RANGE clause
In SQL Server 2008 and 2005, if you would like to calculate moving averages, such as a 20-day moving average closing price of a ticker symbol, the best way would be to use the ROW_NUMBER() function as below.
; WITH CTE_Price
( SELECT Ticker, TradeDate, [Close], ROW_NUMBER() OVER (PARTITION BY Ticker ORDER BY TradeDate) as rowNum
SELECT p.Ticker, p.TradeDate, AVG(pma.[Close]) as Close20MA
FROM CTE_Price p
JOIN CTE_Price pma
ON pma.Ticker = p.Ticker and pma.rowNum between p.rowNum - 19 and p.rowNum
GROUP BY p.Ticker, p.TradeDate
ORDER BY p.Ticker, p.TradeDate
The OVER clause partitions the table by ticker and sorts the rows in each partition in the ascending order of TradeDate. The ROW_NUMBER() function assigns a sequential number to each row within a partition, starting at 1. The result set represented by the common table expression CTE_Price is then joined with itself. Each row is matched with 20 rows whose row numbers are between the current row number minus 19 and the current row number. The average close price of the 20 rows gives you the 20-day moving average price. Here is the result.
If you look at the query execution plan as shown below, you can see the EqityPrice table is scanned (Clustered Index Scan operator) twice, partitioned (Segment operator) twice, and assigned row numbers (Sequence Project operator) twice. The two partitioned result sets are nested loop joined to get the rows in the appropriate row number range.
The enhanced OVER clause allows you to calculate the moving average directly without using the common-table expression and specifying Row_Number(). Here is the new query in Denali.
SELECT Ticker, TradeDate, AVG([Close]) OVER (PARTITION BY Ticker ORDER BY TradeDate ROWS BETWEEN 19 PRECEDING AND CURRENT ROW) as Close20MA
How much simpler the new query is! Let's find out how SQL Server smartly figures out which rows to average on. Here is the execution plan. (Due to the length of the plan, I have to break the plan into two screen shots).
As you can see above, the table EquityPrice is only scanned (Clustered Index Scan operator) once. How does SQL Server select the preceding 19 rows for each row? It actually uses Row_Number() under the hook! The first Segment operator is exactly the same as the Segment operator we saw in the previous execution plan. It assigns a sequential row number to each row based on the order of TradeDate in each ticker partition.
The following Window Spool operator creates a copy of the intermediate result set in a hidden work table. That is why the Actual Number of Rows doubles from 1608 to 3216.
The following Stream Aggregate operator calculates running aggregates from the first row to the current row. For example, for a row with a trade date 2010-04-15, the operator calculates the cumulative close price (CumulativeBottom1018), and cumulative row count (CumulativeBottom1016) from the first date of the year 2010-01-04 to 2010-04-15.
The intermediate result set with the cumulative totals is then sorted by Ticker and TradeDate (Sort operator), and partitioned again by Ticker (second Segment operator). The second Sequence Project operator assigns a row number (RowNumber1020) to each row in the intermediate result set.
The following Compute Scalar operator calculates another row number BottomRowNumber1021 as RowNumber1020 minus the size of our moving average window (20 rows).
The following Window Spool operator matches each row with a preceding row with a BottomRowNumber1020 value equal to its BottomRowNumber1021. For example, a row with BottomRowNumber1020 = 30 and BottomRowNumber1021 = 10 is matched with a row with BottomRowNumber1020 = 10. Because the first 20 rows have either negative or zero BottomRowNumber1021, they don't have a match. That is why the Actual Number of Rows = 1608 * 2 - 20 * 4 = 3136. For the row with BottomRowNumber1020 = 30, it has the cumulative close price of first 30 rows in a year. Its matching row with BottomRowNumber1020 = 10 has the cumulative close price of first 10 rows in a year. If you take the difference of the cumulative prices between the two rows, you have the 20-day cumulative price. If you divide the cumulative price by 20, then you get the 20-day moving average. That is how the following Stream Aggregate and Compute Scalar operators calculate the 20-day moving averages. The property of the Stream Aggregate operator shows four cumulative outputs. [CumulativeBottom1016] is the cumulative row count of the current row, [CumulativeTop1017] is the cumulative row count of the preceding row, [CumulativeBottom1018] is the cumulative close price of the current row, and [CumulativeTop1019] is the cumulative close price of the preceding row. The Compute Scalar operator takes these cumulative outputs, and calculates their difference in Expr1014 and Expr1015.
Here is the query result. Please note that even without an ORDER BY clause, the second query still returns the result set in the ascending order of Ticker and TradeDate because of the Sort operator in the plan.
Because the second query that uses the new ROW or RANGE clause in Denali only scans the EquityPrice table once, STATISTICS IO shows only 13 logical reads compared to 26 logical reads in the previous query. However, as you can see from the execution plan, the new ROW or RANGE clause still uses the Row_Number() function to find the rows to aggregate with.
Let me show you another example that calculates cumulative total.
Assume that you would like to calculate the cumulative monthly trading volume of each ticker. In SQL Server 2008 and 2005, you can write the following query.
SELECT p.Ticker, p.TradeDate, SUM(pra.Volume) as CumulativeMonthlyVolume
FROM EquityPrice p
JOIN EquityPrice pra
ON pra.Ticker = p.Ticker and pra.TradeDate >= DateAdd(MONTH, DateDiff(MONTH, 0, p.TradeDate), 0) and pra.TradeDate <= p.TradeDate
GROUP BY p.Ticker, p.TradeDate
ORDER BY p.Ticker, p.TradeDate
In Denali, the query is simplified to the following query that uses UNBOUNDED PRECEDING in the Row clause.
, SUM(Volume) OVER (PARTITION BY Ticker, DateAdd(MONTH, DateDiff(MONTH, 0, TradeDate), 0) ORDER BY TradeDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS CumulativeMonthlyVolume
Again, you can see the Sequence Project operator calculates the row number with the ROW_NUMBER() function.
Again, just like calculating the 20-day moving average, the following Window Spool operator creates a copy of the intermediate result set in a hidden work table, and doubles the Actual Number of Rows. The Stream Aggregate calculates cumulative totals for each row with the work table.
STATISTICS IO shows only 19 logical reads compared to 33 logical reads of the query in SQL Server 2008 and 2005.
The OVER clause in Denali has been enhanced to support moving average and cumulative total with the new ROW or RANGE clause. This article has shown you how to write queries to calculate moving average and cumulative total with the new syntax. Besides the simplicity of the new syntax, the query performance has also been improved significantly. In the next article, I will show you the new analytic functions.
Yan Pan has more than 9 years of experience working on various versions of SQL Server, from SQL Server 2000 to SQL Server 2008 R2, mainly in the financial industry. She is certified in MCITP SQL Server 2008/2005, MCDBA SQL Server 2000, and OCA Oracle 10g. She published a Wrox book titled "Microsoft SQL Server 2008 Administration with Windows PowerShell." Besides SQL Server and Windows PowerShell, she also performs financial data development and research, develops complex business intelligence solutions with Microsoft technologies, and programs in .NET and MATLAB. If you would like to read more articles by her, visit her website at http://yanpansql.com/.