Introduction
Azure SQL Data Warehouse is Microsoft’s new enterprise-class, elastic petabyte-scale, data warehouse service that can scale according to organizational demands in just few minutes. In my last article of the series, I discussed round-robin and the distributed table, how partitioning works in SQL Data Warehouse and looked at the impact of choosing the right distribution key. I also talked about PolyBase and how you can leverage it to import data-in or export data-out from SQL Data Warehouse very easily and quickly. In this article, I am going to cover performance optimization in SQL Data Warehouse and the different aspects that impact performance.
Importance of Statistics
Statistics play a significant role in query performance. In SQL Data Warehouse, the Query Optimizer uses statistics to create distributed query execution plans that improve query performance; this includes estimating the cost of both data movement and query operations.
Statistics refers to the statistical information about the distribution of values in one or more columns of a table. The SQL Data Warehouse Query Optimizer uses this statistical information to estimate the cardinality, or number of rows, in the query result to be returned and enable the Query Optimizer to create a high-quality distributed query execution plan. For example, based on this statistical information the Query Optimizer might decide whether to use index seek operator or more resource-intensive index scan operator to provide optimal query performance.
In SQL Data Warehouse, statistics are maintained at both control node as well as on each of the compute nodes (distributions) to improve query performance by minimizing the data movement in the distributed query execution plan.
- The control node keeps statistics for distributed tables by merging statistics-related contents of a table residing on all the distributions and storing the consolidated statistics information with it.
- Distributions maintain statistics information about the data that it contains (the portion of the distributed table) and uses them to improve query performance for the queries that run on that specific distribution.
As of this writing, SQL Data Warehouse does not automatically create or update statistics, unlike SQL Server, and hence we need to setup a process to use CREATE STATISTICS and UPDATE STATISTICS commands to create and update statistics in order to ensure query optimizer uses these statistics for a distributed execution plan, including data movement operations. It’s recommended to create statistics on distributed tables for all the strategic columns if they are used in:
- Join predicates
- WHERE clauses
- GROUP BY clauses
- ORDER BY clauses
Statistics become stale over the period of time when significant changes are done with the data in the table. In this case, you need to setup a process to update statistics after significant changes are applied to the data. But please, be careful and don’t update statistics too often. Updating statistics on a table causes a query to recompile if that specific query refers that specific table and unlike index rebuild, which can be done for each specific partition, statistics update is done for the complete table and in some cases where a table contains a huge volume of data it might take a very long time. In order to identify when update statistics are needed, there are a couple of Dynamic Management Views (DMVs) or Catalog Views in SQL Data Warehouse. You can use these queries to identify the percentage of mismatch and if the percentage is beyond some threshold (like 5% or 10% – you can decide on it based on your workload) you can consider updating it.
To help you automate the process of managing statistics during automated data load, I have created a small framework to manage (recreate or update) statistics on the SQL Data Warehouse tables. You need to call the below stored procedure and pass on the required parameter, like schema name, table name and type of action (whether to drop/create or update) you want to perform.
--To drop existing and creating new statistics EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'CREATE' --To update statistics EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'UPDATE'
The above stored procedure basically refers a meta-data tables ([dbo].[StatisticsMetadata]) to drop and create statistics on a table. You can specify all the columns of the table for which you want statistics to be maintained in this meta-data table. In the absence of the meta-data for a given table, it creates statistics on all the columns of the specified table. This way you can start creating statistics for all the columns in the table and as you eventually learn, you can add strategic columns to the meta-data table to manage statistics for those selected columns only (this way you can avoid additional overhead in maintaining statistics for columns that don’t impact on query performance).
The script below creates tables for creating meta-data and execution meta-data tables for this framework.
CREATE TABLE [dbo].[StatisticsMetadata] ( [StatisticsMetadataId] int NULL, [SchemaName] varchar(255), [TableName] varchar(255), [ColumnName] varchar(255), [ScanType] char(10), [IsActive] bit NULL ) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN); CREATE TABLE [dbo].[StatisticsExecutionMetadata] ( [SchemaName] varchar(255), [TableName] varchar(255), [ColumnName] varchar(255), [ActionType] char(10), [ScanType] char(10), [CreatedOn] datetime NULL ) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN);
The above stored procedure logs the operations as it performs in the execution meta-data table ([dbo].[StatisticsExecutionMetadata]) in case you want to reference them for auditing purposes. This is very helpful as it keeps track of existing statistics on a table as well as tracks time when a specific statistic was created or updated.
The script below contains code for the stored procedure used for managing statistics; please feel free to modify it to suit to your specific requirement.
CREATE PROC [dbo].[uspManageStatistics] @SchemaName [VARCHAR](255), @TableName [VARCHAR](255), @ActionType [CHAR](10) AS BEGIN /***************************************************************************** PROCEDURE NAME: [uspManageStatistics] DESCRIPTION: It is used to Manage the statistics for the specified table, You need to first enter records in [dbo].[StatisticsMetadata] table for your table and columns for which you need to manage the statistics and type of manage (FULLSCAN or SAMPLE). Next you need to call this stored procedure to manage the statistics - ActionType = CREATE - drops and creates statistics: EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'CREATE' You can use ActionType = UPDATE to update table statistics instead of DROPCREATE: EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'UPDATE' ****************************************************************************/ SET NOCOUNT ON DECLARE @SQLString NVARCHAR(MAX); DECLARE @StatisticsMetadataId INT, @ColumnName VARCHAR(255), @ScanType CHAR(10) DECLARE @CreatedOn DATETIME = GETDATE() IF @ActionType = 'UPDATE' BEGIN SET @ScanType = 'FULLSCAN' SET @SQLString = 'UPDATE STATISTICS ' + @SchemaName + '.' + @TableName + ' WITH ' + @ScanType + ';' PRINT @SQLString --EXECUTE sp_executesql @SQLString INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn]) VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn) END ELSE IF @ActionType = 'CREATE' BEGIN --Dropping statistics IF OBJECT_ID('tempdb..#DropStatsList') IS NOT NULL BEGIN DROP TABLE #DropStatsList END CREATE TABLE #DropStatsList ([DropStatsCommand] VARCHAR (MAX)) WITH (DISTRIBUTION = ROUND_ROBIN, LOCATION = USER_DB, HEAP); INSERT INTO #DropStatsList([DropStatsCommand]) SELECT 'DROP STATISTICS ' + @SchemaName + '.' + st.name + '.' + ss.name + ';' FROM SYS.TABLES ST INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id INNER JOIN sys.stats ss on st.object_id = ss.object_id INNER JOIN sys.stats_columns sc on sc.stats_id =ss.stats_id and sc.object_id = ss.object_id INNER JOIN sys.columns c on C.object_id = st.object_id and sc.column_id = c.column_id WHERE User_created=1 AND sch.name = @SchemaName AND st.name = @TableName --SELECT * FROM #DropStatsList WHILE EXISTS (SELECT TOP 1 1 FROM #DropStatsList) BEGIN SET @SQLString = (SELECT TOP 1 [DropStatsCommand] FROM #DropStatsList) --PRINT @SQLString EXECUTE sp_executesql @SQLString DELETE #DropStatsList WHERE [DropStatsCommand] = @SQLString END --Creating statistics IF EXISTS (SELECT TOP 1 1 FROM [dbo].[StatisticsMetadata] WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName AND IsActive = 1) BEGIN IF OBJECT_ID('tempdb.dbo.#OperationItem') IS NOT NULL DROP TABLE #OperationItem CREATE TABLE #OperationItem ( [StatisticsMetadataId] INT, [ColumnName] [varchar](255), [ScanType] [char](10), [IsProcessed] [bit] )WITH ( LOCATION = USER_DB, DISTRIBUTION = ROUND_ROBIN, HEAP ); INSERT INTO #OperationItem ([StatisticsMetadataId], [ColumnName], [ScanType], [IsProcessed]) SELECT [StatisticsMetadataId], [ColumnName], [ScanType], 0 FROM [dbo].[StatisticsMetadata] WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName AND IsActive = 1 WHILE EXISTS (SELECT TOP 1 [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0) BEGIN SET @StatisticsMetadataId = (SELECT TOP 1 [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0) SET @ColumnName = (SELECT TOP 1 [ColumnName] FROM #OperationItem WHERE [StatisticsMetadataId] = @StatisticsMetadataId) SET @ScanType = (SELECT TOP 1 [ScanType] FROM #OperationItem WHERE [StatisticsMetadataId] = @StatisticsMetadataId) IF @ScanType = 'FULLSCAN' SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @ColumnName + ' ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ') WITH ' + @ScanType + '' ELSE SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @ColumnName + ' ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ')' --PRINT @SQLString EXECUTE sp_executesql @SQLString UPDATE #OperationItem SET [IsProcessed] = 1 WHERE [StatisticsMetadataId] = @StatisticsMetadataId --INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn]) --VALUES (@SchemaName, @TableName, @ColumnName, @ActionType, @ScanType, @CreatedOn) END INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn]) VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn) END ELSE BEGIN IF OBJECT_ID('tempdb..#CreateStatsList') IS NOT NULL BEGIN DROP TABLE #CreateStatsList END CREATE TABLE #CreateStatsList ([CreateStatsColumn] VARCHAR(255)) WITH (DISTRIBUTION = ROUND_ROBIN, LOCATION = USER_DB, HEAP); INSERT INTO #CreateStatsList([CreateStatsColumn]) SELECT c.name FROM sys.tables st INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id INNER JOIN sys.columns c on C.object_id = st.object_id WHERE sch.name = @SchemaName AND st.name = @TableName AND c.name not in ('EDW_ModifiedByID','EDW_CreatedByID','EDW_ModifiedOn','EDW_CreatedOn') DECLARE @CreateStatsColumn AS VARCHAR(255) --SELECT * FROM #CreateStatsList SET @ScanType = 'FULLSCAN' WHILE EXISTS (SELECT TOP 1 1 FROM #CreateStatsList) BEGIN SET @CreateStatsColumn = (SELECT TOP 1 [CreateStatsColumn] FROM #CreateStatsList) IF @ScanType = 'FULLSCAN' SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @CreateStatsColumn + ' ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ') WITH ' + @ScanType + '' ELSE SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' + @TableName + '_' + @CreateStatsColumn + ' ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ')' --PRINT @SQLString EXECUTE sp_executesql @SQLString DELETE #CreateStatsList WHERE [CreateStatsColumn] = @CreateStatsColumn --INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn]) --VALUES (@SchemaName, @TableName, @CreateStatsColumn, @ActionType, @ScanType, @CreatedOn) END INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn]) VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn) END END SET NOCOUNT OFF END
In case you have composite joins on multiple-columns that have cross-column relationships and dependencies on each other, creating statistics on the multiple columns might improve the distributed query execution plan. A multiple-columns statistic contains cross-column correlation statistical information, called densities, with respect to neighboring columns and helps query optimizer to choose the right physical operators (for example hash join over nested loop join, as a nested loop join is often not optimal in data warehousing scenario).