Getting Started with Azure SQL Data Warehouse - Part 5 - Page 2
By Arshad Ali
If you have experience working with SQL Server, you might have used Resource Governor to manage the workload in SQL Server to make resources available to all the requests and better utilize the resources. In a similar line, SQL Data Warehouse also has workload management capability built in. This capability allows you to assign an incoming request to have pre-configured resources (like memory and concurrency) so that they don’t get starved forever and SQL Data Warehouse better utilizes system resources to achieve the best performance for concurrent requests.
In SQL Data Warehouse, a resource class is a built-in server role that has pre-assigned limits for memory and concurrency and varies from different DWU tiers. Though irrespective of DWU tiers, there are four different resource classes and resources for these resource classes are mentioned in “Concurrency and workload management in SQL Data Warehouse” for each of the tiers. SQL Data Warehouse allocates resources to requests according to the resource class server role membership of the login that submits the requests.
SELECT r.nameAS [Member Name], m.nameAS [Resource Class]FROMsys.database_role_members rmJOINsys.database_principalsAS r ON rm.role_principal_id = r.principal_idJOINsys.database_principalsAS m ON rm.member_principal_id = m.principal_idWHERE r.nameIN('mediumrc','largerc','xlargerc');
Default or small resource class will be used if the login executing the query has not been assigned to any other resource classes. You need to assign a login to the higher resource class if the query it’s going to execute is very expensive. For example, you can consider using mediumrc (medium resource class) for CTAS operations that have large hash joins or SELECT operations that need more memory to avoid caching to disk or building, rebuilding, and reorganizing clustered columnstore indexes for smaller tables that have 10-15 columns, etc. Likewise you might consider using largerc (large resource class) if your query is more expensive than that, for example, if you are running very heavy CTAS operations that have huge hash joins, contain large aggregations (ORDER BY or GROUP BY clauses) or SELECT operations that require very large amounts of memory for operations like hash joins, aggregations (ORDER BY or GROUP BY clauses). Please be careful when using xlargerc (extra-large class) and use when a request requires extra-large resource consumption at run time.
Please note, the higher resource class you use for executing queries, the more resources it’s going to consume, limiting the maximum number of concurrent queries. Hence, I would suggest to you to go through “Concurrency and workload management in SQL Data Warehouse in detail and decide on a resource class based on DWU tier and concurrency need for your workload.
Although the Massively Parallel Processing (MPP) engine of SQL Data Warehouse does a fabulous job in executing your query in parallel across multiple nodes and provides faster response time, this section covers some of the controls that you have at hand to optimize the performance to meet your performance expectation.
Make sure you have the right collation setting for your SQL Data Warehouse database when you are creating it. Changing collation after database creation is not supported as of now.
SQL Data Warehouse does not support cross database queries. Hence, during your migration you can migrate tables from all your databases to a single SQL Data Warehouse database. We used this naming pattern for objects in SQL Data Warehouse and it worked fine in migration. <PreviousDatabaseName_PreviousSchemaName>.<TableName>.
As you can notice, the previous database and schema name now become a concatenated schema name in SQL Data Warehouse. This also requires you to change all references of these objects wherever applicable (like in views, stored procedures, etc).
Even though you have a distribution key on the same column for multiple tables, there might be possible data movement if the data type for this column is not exactly the same as the data type for the column on the other table. This happens because SQL Data Warehouse must do implicit type conversion before joins and hence you need to make sure you use distribution column in joins (or aggregations) with same data type to avoid implicit type conversion, eg. <int>=<bigint>.
Consider using smallrc for all select queries, use mediumrc or largerc for bigger data loads (especially loading directly to a CCI table or rebuilding CCI).
During data load or during maintenance you can scale your database to a higher DWU tier and once done scale it down. This way you can balance cost and performance.
As of this writing, SQL Data Warehouse supports a maximum 32 concurrent queries, in cases where you expect a large number of concurrent queries, consider using SQL Server Analysis Services (SSAS) or Azure Analysis Services as a caching layer and to support higher traffic.
SQL Data Warehouse allows you to specify a label for your query; don’t be shy to use it. Use it extensively as this will help you in troubleshooting. Make sure though you have defined a naming convention for these labels so that it’s not confusing to you or to others who are working on troubleshooting.
Though its applicable for any platform, particularly in SQL Data Warehouse it’s more important to use the smallest possible column size in your table definition. This is because Data Movement Service (DMS) uses fixed length as of maximum size (even the column data type is variable) when moving data across. This is very much like how data buffers work in SQL Server Integration Services (SSIS).
Likewise, whenever not needed don’t use NAVARCHAR or NCHAR, instead use VARCHAR or CHAR.
Whenever you face any performance issue, the first and foremost thing to check is the presence of the statistics and if those are up to date for all those involved tables. In several of my cases, only ensuring up to date statistics on involved tables improved the performance of the queries several times.
When there is a large data change in the table, you might consider re-creating or updating statistics on that table.
There are DMVs available to find out the difference between control and compute nodes and if the difference is say 5% or 10% (test it in your case to arrive on this number applicable in your specific case) then only update the statistics. This way you can optimize your load process and offload statistics update from regular data load process to maintenance process, which often runs in off-peak hours anyway.
You can start with creating statistics on all the columns of the table and gradually identify strategic columns used in joins, aggregation, filter and sorting and maintain statistics for those columns only. This way you can minimize the time it takes for re-creating or updating statistics on that table.
Clustered Columnstore index (CCI)
By default, every table gets created with CCI in SQL Data Warehouse; don’t change it unless you have a table with a considerably less number of rows.
When you have lots of updates or deletes on a table with CCI, over a period of time that table will have lots of ghost rows in the delete bitmap (a row is not deleted immediately from a table with CCI, it is marked as logically deleted and doesn’t reclaim physical storage from that row until the index is rebuilt) - sometime referred to as index fragmentation - this can cause performance issues for your queries and data load processes. To get rid of these unwanted rows or to remove fragmentation, you can either recreate the table with CTAS or rebuild the CCI.
To identify when an index rebuild is needed, there are couple of Dynamic Management Views (DMVs) or Catalog Views in SQL Data Warehouse. You can use these DMVs to identify the percentage of deleted rows and if the percentage is beyond some threshold (like 10% or 15% - you can decide based on your workload and performance needs) you can consider index rebuilding.
When you are rebuilding CCI, do it with a medium or large resource class, depending on the size of the table and DWU tier.
Data Load Process
During data refresh (ETL or ELT), normally we first bring data to a staging table and then transform and move to the final table (dimension or fact) on a defined interval. In this case, you can create the staging table as HEAP (as opposed to default CCI) and evaluate the performance. This should improve performance, as normally data refresh on interval will not have large data to get the benefit of CCI. Also, often you will be querying all the data (all columns) from the staging table to the final table.
When using partitioning, remember that SQL Data Warehouse already distributes data across 60 distributions and hence you should use it at a higher level to ensure each of these partitions get at least 1 million rows for better compression in CCI – leading to better performance.
In the case of SQL Server, you can switch-in a non-partitioned staging table to a partition of a partitioned table if the non-partitioned table has CHECK constraint defined. However, because there is no support for CHECK constraint in SQL Data Warehouse, you can only switch in from one partitioned table to another partitioned table (one partition at time) and hence make sure your staging table has the same partitioning strategy as your final table. Also, the distribution type and distribution key should match.
You can consider having narrower partitions for recent data (for example daily partitions in the current month) and wider partitions for older data (for example, monthly partitions for previous months in current year and yearly partitions in all earlier years) based on your query pattern.
For a large fact table, you can use partition switch-in to load data into it. Likewise, to delete or archive the data you can use partition switch-out. This way you can minimize the time and impact to other queries referencing this table.
During historical data load when you have a very large table to migrate, I would suggest moving data into chunks; this way your migration is more manageable and you will have better restart-ability from the point of failure in case of any failure. If you have a partitioned table, you can migrate data for each partition as one chunk (may run multiple processes to run them in parallel for quicker migration though). Once you have data imported for all the partitions in internal staging tables within SQL Data Warehouse, you can use partition switch-in to combine them together in your final table. When you use this technique, just make sure your internal staging tables (which hold data for each of the partitions) have exactly the same distribution type, distribution key and partitioning strategy as your final table.
If your source table is not partitioned or if you don’t want to create your final table as a partitioned table, then you can use CTAS to create your final table by combining data from all internal staging tables (with UNION ALL). But this process is going to be slow and might face memory out of exception. In that case, you can consider scaling your SQL Data Warehouse to a higher DWU tier and consider using largerc or xlargerc when running CTAS.
Currently, Polybase supports up to 1MB of row width, which will be sufficient in most of the cases. In case you have a very wide row, then you can consider doing vertical partitioning of your data into multiple tables and use the same distribution key to avoid any data movement when joining these tables.
If you have Unicode data, you can use Encoding = ‘UTF16’ in external file format to import the data into.
If your data has embedded lines or a line feed in it, currently you have two choices. You can replace those values before exporting data out from your source or use another file format like RCFILE.
As discussed in my earlier article, Polybase is designed to move data in parallel and hence it’s recommended to move data into multiple smaller files rather than moving it in a large single file. When you export data out from your source, export it into 60 or more files rather than exporting it in a single or smaller number of files.
In this article of the series, we looked the importance of statistics, how workload management works and how to go about performance optimization in SQL Data Warehouse and different aspects that impacts query performance.
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.