Introduction
Azure SQL Data Warehouse is Microsoft’s new enterprise-class, elastic petabyte-scale, data warehouse service that can scale according to organizational demands in just few minutes. In my last article of the series, I discussed round-robin and the distributed table, how partitioning works in SQL Data Warehouse and looked at the impact of choosing the right distribution key. I also talked about PolyBase and how you can leverage it to import data-in or export data-out from SQL Data Warehouse very easily and quickly. In this article, I am going to cover performance optimization in SQL Data Warehouse and the different aspects that impact performance.
Importance of Statistics
Statistics play a significant role in query performance. In SQL Data Warehouse, the Query Optimizer uses statistics to create distributed query execution plans that improve query performance; this includes estimating the cost of both data movement and query operations.
Statistics refers to the statistical information about the distribution of values in one or more columns of a table. The SQL Data Warehouse Query Optimizer uses this statistical information to estimate the cardinality, or number of rows, in the query result to be returned and enable the Query Optimizer to create a high-quality distributed query execution plan. For example, based on this statistical information the Query Optimizer might decide whether to use index seek operator or more resource-intensive index scan operator to provide optimal query performance.
In SQL Data Warehouse, statistics are maintained at both control node as well as on each of the compute nodes (distributions) to improve query performance by minimizing the data movement in the distributed query execution plan.
- The control node keeps statistics for distributed tables by merging statistics-related contents of a table residing on all the distributions and storing the consolidated statistics information with it.
- Distributions maintain statistics information about the data that it contains (the portion of the distributed table) and uses them to improve query performance for the queries that run on that specific distribution.
As of this writing, SQL Data Warehouse does not automatically create or update statistics, unlike SQL Server, and hence we need to setup a process to use CREATE STATISTICS and UPDATE STATISTICS commands to create and update statistics in order to ensure query optimizer uses these statistics for a distributed execution plan, including data movement operations. It’s recommended to create statistics on distributed tables for all the strategic columns if they are used in:
- Join predicates
- WHERE clauses
- GROUP BY clauses
- ORDER BY clauses
Statistics become stale over the period of time when significant changes are done with the data in the table. In this case, you need to setup a process to update statistics after significant changes are applied to the data. But please, be careful and don't update statistics too often. Updating statistics on a table causes a query to recompile if that specific query refers that specific table and unlike index rebuild, which can be done for each specific partition, statistics update is done for the complete table and in some cases where a table contains a huge volume of data it might take a very long time. In order to identify when update statistics are needed, there are a couple of Dynamic Management Views (DMVs) or Catalog Views in SQL Data Warehouse. You can use these queries to identify the percentage of mismatch and if the percentage is beyond some threshold (like 5% or 10% - you can decide on it based on your workload) you can consider updating it.
To help you automate the process of managing statistics during automated data load, I have created a small framework to manage (recreate or update) statistics on the SQL Data Warehouse tables. You need to call the below stored procedure and pass on the required parameter, like schema name, table name and type of action (whether to drop/create or update) you want to perform.
--To drop existing and creating new statistics
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'CREATE'
--To update statistics
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'UPDATE'
The above stored procedure basically refers a meta-data tables ([dbo].[StatisticsMetadata]) to drop and create statistics on a table. You can specify all the columns of the table for which you want statistics to be maintained in this meta-data table. In the absence of the meta-data for a given table, it creates statistics on all the columns of the specified table. This way you can start creating statistics for all the columns in the table and as you eventually learn, you can add strategic columns to the meta-data table to manage statistics for those selected columns only (this way you can avoid additional overhead in maintaining statistics for columns that don’t impact on query performance).
The script below creates tables for creating meta-data and execution meta-data tables for this framework.
CREATE TABLE [dbo].[StatisticsMetadata] (
[StatisticsMetadataId] int NULL,
[SchemaName] varchar(255),
[TableName] varchar(255),
[ColumnName] varchar(255),
[ScanType] char(10),
[IsActive] bit NULL
)
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN);
CREATE TABLE [dbo].[StatisticsExecutionMetadata] (
[SchemaName] varchar(255),
[TableName] varchar(255),
[ColumnName] varchar(255),
[ActionType] char(10),
[ScanType] char(10),
[CreatedOn] datetime NULL
)
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN);
The above stored procedure logs the operations as it performs in the execution meta-data table ([dbo].[StatisticsExecutionMetadata]) in case you want to reference them for auditing purposes. This is very helpful as it keeps track of existing statistics on a table as well as tracks time when a specific statistic was created or updated.
The script below contains code for the stored procedure used for managing statistics; please feel free to modify it to suit to your specific requirement.
CREATE PROC [dbo].[uspManageStatistics] @SchemaName [VARCHAR](255), @TableName [VARCHAR](255), @ActionType [CHAR](10) AS
BEGIN
/*****************************************************************************
PROCEDURE NAME: [uspManageStatistics]
DESCRIPTION: It is used to Manage the statistics for the specified table,
You need to first enter records in [dbo].[StatisticsMetadata] table for your table and columns for which you need to manage
the statistics and type of manage (FULLSCAN or SAMPLE). Next you need to call this stored procedure to manage the
statistics - ActionType = CREATE - drops and creates statistics:
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'CREATE'
You can use ActionType = UPDATE to update table statistics instead of DROP\CREATE:
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'UPDATE'
****************************************************************************/
SET NOCOUNT ON
DECLARE @SQLString NVARCHAR(MAX);
DECLARE @StatisticsMetadataId INT, @ColumnName VARCHAR(255), @ScanType CHAR(10)
DECLARE @CreatedOn DATETIME = GETDATE()
IF @ActionType = 'UPDATE'
BEGIN
SET @ScanType = 'FULLSCAN'
SET @SQLString = 'UPDATE STATISTICS ' + @SchemaName + '.' + @TableName + ' WITH ' + @ScanType + ';'
PRINT @SQLString
--EXECUTE sp_executesql @SQLString
INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
END
ELSE IF @ActionType = 'CREATE'
BEGIN
--Dropping statistics
IF OBJECT_ID('tempdb..#DropStatsList') IS NOT NULL
BEGIN
DROP TABLE #DropStatsList
END
CREATE TABLE #DropStatsList
([DropStatsCommand] VARCHAR (MAX))
WITH (DISTRIBUTION = ROUND_ROBIN, LOCATION = USER_DB, HEAP);
INSERT INTO #DropStatsList([DropStatsCommand])
SELECT 'DROP STATISTICS ' + @SchemaName + '.' + st.name + '.' + ss.name + ';'
FROM SYS.TABLES ST
INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id
INNER JOIN sys.stats ss on st.object_id = ss.object_id
INNER JOIN sys.stats_columns sc on sc.stats_id =ss.stats_id and sc.object_id = ss.object_id
INNER JOIN sys.columns c on C.object_id = st.object_id and sc.column_id = c.column_id
WHERE User_created=1 AND sch.name = @SchemaName AND st.name = @TableName
--SELECT * FROM #DropStatsList
WHILE EXISTS (SELECT TOP 1 1 FROM #DropStatsList)
BEGIN
SET @SQLString = (SELECT TOP 1 [DropStatsCommand] FROM #DropStatsList)
--PRINT @SQLString
EXECUTE sp_executesql @SQLString
DELETE #DropStatsList WHERE [DropStatsCommand] = @SQLString
END
--Creating statistics
IF EXISTS (SELECT TOP 1 1 FROM [dbo].[StatisticsMetadata] WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName AND IsActive = 1)
BEGIN
IF OBJECT_ID('tempdb.dbo.#OperationItem') IS NOT NULL
DROP TABLE #OperationItem
CREATE TABLE #OperationItem
(
[StatisticsMetadataId] INT,
[ColumnName] [varchar](255),
[ScanType] [char](10),
[IsProcessed] [bit]
)WITH ( LOCATION = USER_DB, DISTRIBUTION = ROUND_ROBIN, HEAP );
INSERT INTO #OperationItem ([StatisticsMetadataId], [ColumnName], [ScanType], [IsProcessed])
SELECT [StatisticsMetadataId], [ColumnName], [ScanType], 0
FROM [dbo].[StatisticsMetadata]
WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName AND IsActive = 1
WHILE EXISTS (SELECT TOP 1 [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0)
BEGIN
SET @StatisticsMetadataId = (SELECT TOP 1 [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0)
SET @ColumnName = (SELECT TOP 1 [ColumnName] FROM #OperationItem WHERE [StatisticsMetadataId] = @StatisticsMetadataId)
SET @ScanType = (SELECT TOP 1 [ScanType] FROM #OperationItem WHERE [StatisticsMetadataId] = @StatisticsMetadataId)
IF @ScanType = 'FULLSCAN'
SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @ColumnName + '
ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ') WITH ' + @ScanType + ''
ELSE
SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @ColumnName + '
ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ')'
--PRINT @SQLString
EXECUTE sp_executesql @SQLString
UPDATE #OperationItem
SET [IsProcessed] = 1
WHERE [StatisticsMetadataId] = @StatisticsMetadataId
--INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
--VALUES (@SchemaName, @TableName, @ColumnName, @ActionType, @ScanType, @CreatedOn)
END
INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
END
ELSE
BEGIN
IF OBJECT_ID('tempdb..#CreateStatsList') IS NOT NULL
BEGIN
DROP TABLE #CreateStatsList
END
CREATE TABLE #CreateStatsList
([CreateStatsColumn] VARCHAR(255))
WITH (DISTRIBUTION = ROUND_ROBIN, LOCATION = USER_DB, HEAP);
INSERT INTO #CreateStatsList([CreateStatsColumn])
SELECT c.name
FROM sys.tables st
INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id
INNER JOIN sys.columns c on C.object_id = st.object_id
WHERE sch.name = @SchemaName AND st.name = @TableName
AND c.name not in ('EDW_ModifiedByID','EDW_CreatedByID','EDW_ModifiedOn','EDW_CreatedOn')
DECLARE @CreateStatsColumn AS VARCHAR(255)
--SELECT * FROM #CreateStatsList
SET @ScanType = 'FULLSCAN'
WHILE EXISTS (SELECT TOP 1 1 FROM #CreateStatsList)
BEGIN
SET @CreateStatsColumn = (SELECT TOP 1 [CreateStatsColumn] FROM #CreateStatsList)
IF @ScanType = 'FULLSCAN'
SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @CreateStatsColumn + '
ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ') WITH ' + @ScanType + ''
ELSE
SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @CreateStatsColumn + '
ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ')'
--PRINT @SQLString
EXECUTE sp_executesql @SQLString
DELETE #CreateStatsList WHERE [CreateStatsColumn] = @CreateStatsColumn
--INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
--VALUES (@SchemaName, @TableName, @CreateStatsColumn, @ActionType, @ScanType, @CreatedOn)
END
INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
END
END
SET NOCOUNT OFF
END
In case you have composite joins on multiple-columns that have cross-column relationships and dependencies on each other, creating statistics on the multiple columns might improve the distributed query execution plan. A multiple-columns statistic contains cross-column correlation statistical information, called densities, with respect to neighboring columns and helps query optimizer to choose the right physical operators (for example hash join over nested loop join, as a nested loop join is often not optimal in data warehousing scenario).