Getting Started with Azure SQL Data Warehouse - Part 5

Introduction

Azure SQL Data Warehouse is Microsoft’s new enterprise-class, elastic petabyte-scale, data warehouse service that can scale according to organizational demands in just few minutes. In my last article of the series, I discussed round-robin and the distributed table, how partitioning works in SQL Data Warehouse and looked at the impact of choosing the right distribution key. I also talked about PolyBase and how you can leverage it to import data-in or export data-out from SQL Data Warehouse very easily and quickly. In this article, I am going to cover performance optimization in SQL Data Warehouse and the different aspects that impact performance.

Importance of Statistics

Statistics play a significant role in query performance. In SQL Data Warehouse, the Query Optimizer uses statistics to create distributed query execution plans that improve query performance; this includes estimating the cost of both data movement and query operations.

Statistics refers to the statistical information about the distribution of values in one or more columns of a table. The SQL Data Warehouse Query Optimizer uses this statistical information to estimate the cardinality, or number of rows, in the query result to be returned and enable the Query Optimizer to create a high-quality distributed query execution plan. For example, based on this statistical information the Query Optimizer might decide whether to use index seek operator or more resource-intensive index scan operator to provide optimal query performance.

In SQL Data Warehouse, statistics are maintained at both control node as well as on each of the compute nodes (distributions) to improve query performance by minimizing the data movement in the distributed query execution plan.

The control node keeps statistics for distributed tables by merging statistics-related contents of a table residing on all the distributions and storing the consolidated statistics information with it.
Distributions maintain statistics information about the data that it contains (the portion of the distributed table) and uses them to improve query performance for the queries that run on that specific distribution.

As of this writing, SQL Data Warehouse does not automatically create or update statistics, unlike SQL Server, and hence we need to setup a process to use CREATE STATISTICS and UPDATE STATISTICS commands to create and update statistics in order to ensure query optimizer uses these statistics for a distributed execution plan, including data movement operations. It’s recommended to create statistics on distributed tables for all the strategic columns if they are used in:

Join predicates
WHERE clauses
GROUP BY clauses
ORDER BY clauses

Statistics become stale over the period of time when significant changes are done with the data in the table. In this case, you need to setup a process to update statistics after significant changes are applied to the data. But please, be careful and don’t update statistics too often. Updating statistics on a table causes a query to recompile if that specific query refers that specific table and unlike index rebuild, which can be done for each specific partition, statistics update is done for the complete table and in some cases where a table contains a huge volume of data it might take a very long time. In order to identify when update statistics are needed, there are a couple of Dynamic Management Views (DMVs) or Catalog Views in SQL Data Warehouse. You can use these queries to identify the percentage of mismatch and if the percentage is beyond some threshold (like 5% or 10% – you can decide on it based on your workload) you can consider updating it.

To help you automate the process of managing statistics during automated data load, I have created a small framework to manage (recreate or update) statistics on the SQL Data Warehouse tables. You need to call the below stored procedure and pass on the required parameter, like schema name, table name and type of action (whether to drop/create or update) you want to perform.

--To drop existing and creating new statistics
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'CREATE'
--To update statistics
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName = 'FactInternetSalesReason', @ActionType = 'UPDATE'

The above stored procedure basically refers a meta-data tables ([dbo].[StatisticsMetadata]) to drop and create statistics on a table. You can specify all the columns of the table for which you want statistics to be maintained in this meta-data table. In the absence of the meta-data for a given table, it creates statistics on all the columns of the specified table. This way you can start creating statistics for all the columns in the table and as you eventually learn, you can add strategic columns to the meta-data table to manage statistics for those selected columns only (this way you can avoid additional overhead in maintaining statistics for columns that don’t impact on query performance).

The script below creates tables for creating meta-data and execution meta-data tables for this framework.

CREATE TABLE [dbo].[StatisticsMetadata] (
    [StatisticsMetadataId] int NULL, 
    [SchemaName] varchar(255), 
       [TableName] varchar(255), 
    [ColumnName] varchar(255), 
    [ScanType] char(10), 
    [IsActive] bit NULL
)
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN);
 
CREATE TABLE [dbo].[StatisticsExecutionMetadata] (
       [SchemaName] varchar(255),
    [TableName] varchar(255), 
    [ColumnName] varchar(255), 
    [ActionType] char(10), 
    [ScanType] char(10), 
    [CreatedOn] datetime NULL
)
WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN);

The above stored procedure logs the operations as it performs in the execution meta-data table ([dbo].[StatisticsExecutionMetadata]) in case you want to reference them for auditing purposes. This is very helpful as it keeps track of existing statistics on a table as well as tracks time when a specific statistic was created or updated.

The script below contains code for the stored procedure used for managing statistics; please feel free to modify it to suit to your specific requirement.

CREATE PROC [dbo].[uspManageStatistics]  @SchemaName [VARCHAR](255), @TableName   [VARCHAR](255), @ActionType [CHAR](10) AS
BEGIN
/*****************************************************************************    
PROCEDURE NAME:      [uspManageStatistics] 
DESCRIPTION:         It is used to Manage the statistics for the   specified table,  
 
You need to first enter records in [dbo].[StatisticsMetadata]    table for your table and columns for which you need to manage 
the statistics and type of manage (FULLSCAN or SAMPLE). Next you   need to call this stored procedure to manage the 
statistics - ActionType = CREATE - drops and creates statistics:
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName   = 'FactInternetSalesReason', @ActionType = 'CREATE'
 
You can use ActionType = UPDATE to update table statistics   instead of DROPCREATE:
EXEC [dbo].[uspManageStatistics] @SchemaName = 'dbo', @TableName   = 'FactInternetSalesReason', @ActionType = 'UPDATE'
 
****************************************************************************/ 
       SET NOCOUNT ON
       DECLARE @SQLString NVARCHAR(MAX);
       DECLARE @StatisticsMetadataId INT, @ColumnName VARCHAR(255), @ScanType CHAR(10)
       DECLARE @CreatedOn DATETIME = GETDATE()
 
       IF @ActionType = 'UPDATE' 
       BEGIN
              SET @ScanType = 'FULLSCAN'
              SET @SQLString = 'UPDATE STATISTICS ' + @SchemaName + '.' + @TableName + ' WITH ' + @ScanType + ';'
              PRINT @SQLString
              --EXECUTE sp_executesql @SQLString
              INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
              VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
       END
       ELSE IF @ActionType = 'CREATE'
       BEGIN
              --Dropping statistics
              IF OBJECT_ID('tempdb..#DropStatsList') IS NOT NULL
              BEGIN
                     DROP TABLE #DropStatsList
              END
              CREATE TABLE #DropStatsList
              ([DropStatsCommand]  VARCHAR       (MAX))
              WITH (DISTRIBUTION = ROUND_ROBIN, LOCATION = USER_DB, HEAP);
 
              INSERT INTO #DropStatsList([DropStatsCommand]) 
              SELECT 'DROP STATISTICS ' + @SchemaName + '.' + st.name + '.' + ss.name  + ';'
              FROM SYS.TABLES ST
              INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id
              INNER JOIN sys.stats ss on st.object_id = ss.object_id 
              INNER JOIN sys.stats_columns sc on sc.stats_id =ss.stats_id and sc.object_id = ss.object_id
              INNER JOIN sys.columns c on C.object_id = st.object_id and sc.column_id = c.column_id
              WHERE User_created=1 AND sch.name = @SchemaName AND st.name = @TableName 
              
              --SELECT * FROM #DropStatsList
 
              WHILE EXISTS (SELECT TOP 1 1 FROM #DropStatsList)
              BEGIN
                     SET @SQLString = (SELECT TOP 1 [DropStatsCommand] FROM #DropStatsList)
                     
                     --PRINT @SQLString
                     EXECUTE sp_executesql @SQLString
                     DELETE #DropStatsList WHERE [DropStatsCommand] = @SQLString
              END
              --Creating statistics
              IF EXISTS (SELECT TOP 1 1 FROM [dbo].[StatisticsMetadata] WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName  AND IsActive = 1)
              BEGIN
                     IF OBJECT_ID('tempdb.dbo.#OperationItem') IS NOT NULL
                           DROP TABLE #OperationItem
 
                     CREATE TABLE #OperationItem
                     (
                           [StatisticsMetadataId] INT,
                           [ColumnName] [varchar](255),
                           [ScanType] [char](10),
                           [IsProcessed] [bit]
                     )WITH ( LOCATION = USER_DB, DISTRIBUTION = ROUND_ROBIN, HEAP );
 
                     INSERT INTO #OperationItem ([StatisticsMetadataId], [ColumnName], [ScanType], [IsProcessed]) 
                     SELECT [StatisticsMetadataId], [ColumnName], [ScanType], 0
                     FROM [dbo].[StatisticsMetadata] 
                     WHERE [SchemaName] = @SchemaName AND [TableName] = @TableName  AND IsActive = 1
 
                     WHILE EXISTS (SELECT TOP 1 [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0)
                     BEGIN
 
                           SET @StatisticsMetadataId = (SELECT TOP 1   [StatisticsMetadataId] FROM #OperationItem WHERE [IsProcessed] = 0)
                           SET @ColumnName = (SELECT TOP 1 [ColumnName] FROM #OperationItem WHERE   [StatisticsMetadataId] = @StatisticsMetadataId)
                           SET @ScanType = (SELECT TOP 1 [ScanType] FROM #OperationItem WHERE   [StatisticsMetadataId] = @StatisticsMetadataId)
 
                           IF @ScanType = 'FULLSCAN'
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' +  @TableName + '_' + @ColumnName + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ') WITH ' + @ScanType + ''
                           ELSE
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_'  + @TableName + '_' + @ColumnName + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @ColumnName + ')'
 
                           --PRINT @SQLString
                           EXECUTE sp_executesql      @SQLString
 
                           UPDATE #OperationItem
                           SET [IsProcessed] = 1
                           WHERE [StatisticsMetadataId] =   @StatisticsMetadataId
                           --INSERT INTO   [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName],   [ActionType], [ScanType], [CreatedOn])
                           --VALUES (@SchemaName, @TableName,   @ColumnName, @ActionType, @ScanType, @CreatedOn)
                     END
                     INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
                     VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
              END
              ELSE
              BEGIN
                     IF OBJECT_ID('tempdb..#CreateStatsList') IS NOT NULL
                     BEGIN
                           DROP TABLE #CreateStatsList
                     END
                     CREATE TABLE #CreateStatsList
                     ([CreateStatsColumn] VARCHAR(255))
                     WITH (DISTRIBUTION = ROUND_ROBIN, LOCATION = USER_DB, HEAP);
 
                     INSERT INTO #CreateStatsList([CreateStatsColumn]) 
                     SELECT c.name 
                     FROM sys.tables st
                     INNER JOIN sys.schemas sch on st.schema_id = sch.schema_id
                     INNER JOIN sys.columns c on C.object_id = st.object_id 
                     WHERE sch.name = @SchemaName AND st.name = @TableName  
                     AND c.name not in ('EDW_ModifiedByID','EDW_CreatedByID','EDW_ModifiedOn','EDW_CreatedOn') 
 
                     DECLARE @CreateStatsColumn AS VARCHAR(255)
 
                     --SELECT * FROM #CreateStatsList
                     SET @ScanType = 'FULLSCAN'
 
                     WHILE EXISTS (SELECT TOP 1 1 FROM #CreateStatsList)
                     BEGIN
                           SET @CreateStatsColumn = (SELECT TOP 1   [CreateStatsColumn] FROM #CreateStatsList)
 
                           IF @ScanType = 'FULLSCAN'
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' +  @TableName + '_' + @CreateStatsColumn + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ') WITH ' + @ScanType + ''
                           ELSE
                                  SET @SQLString = 'CREATE STATISTICS stat_' + @SchemaName + '_' +  @TableName + '_' + @CreateStatsColumn + '
                                                              ON ' + @SchemaName + '.'+ @TableName + ' (' + @CreateStatsColumn + ')'
 
                           --PRINT @SQLString
                           EXECUTE sp_executesql @SQLString
                           DELETE #CreateStatsList WHERE [CreateStatsColumn]   = @CreateStatsColumn
                           --INSERT INTO   [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName],   [ActionType], [ScanType], [CreatedOn])
                           --VALUES (@SchemaName, @TableName,   @CreateStatsColumn, @ActionType, @ScanType, @CreatedOn)
                     END
                     INSERT INTO [dbo].[StatisticsExecutionMetadata] ([SchemaName], [TableName], [ColumnName], [ActionType], [ScanType], [CreatedOn])
                     VALUES (@SchemaName, @TableName, NULL, @ActionType, @ScanType, @CreatedOn)
              END
       END
       SET NOCOUNT OFF
END

In case you have composite joins on multiple-columns that have cross-column relationships and dependencies on each other, creating statistics on the multiple columns might improve the distributed query execution plan. A multiple-columns statistic contains cross-column correlation statistical information, called densities, with respect to neighboring columns and helps query optimizer to choose the right physical operators (for example hash join over nested loop join, as a nested loop join is often not optimal in data warehousing scenario).

Getting Started with Azure SQL Data Warehouse – Part 5

Introduction

Importance of Statistics

Arshad Ali

Company

Categories