Introduction
Azure SQL Data Warehouse is Microsoft’s new enterprise-class, elastic petabyte-scale, data warehouse service that can scale according to organizational demands in just few minutes. In my last article of the series, I discussed the importance of statistics, how to created them for better performance, workload management and performance optimization guidelines in SQL Data Warehouse. This article will cover the different ways to loading data into SQL Data Warehouse and how they work and perform.
Data Movement Methods
Azure SQL Data Warehouse provides different options to move data in and out; you can choose one or combination based on your expertise and data load performance requirements.
On a high level, these different methods can be categorized into two groups:
- Parallel Data Load – This method allows data load to leverage the power of the MPP (Massively Parallel Processing) engine of SQL Data Warehouse. With Polybase, data loads or extracts are done in parallel from multiple compute nodes to\from multiple data files in either Azure Blob Storage or Azure Data Lake Store. (As of this writing, these are two supported sources in Polybase of SQL Data Warehouse). That means, the higher the size of the SQL Data Warehouse, the better performance you can expect.
- Polybase
- Azure Data Factory with Polybase
- SQL Server Integration Services (SSIS) with control flow component, which internally leverages Polybase
- Single Gated Client – This method of data transfer involves control node. It means all incoming connections and data movement go through the control node. Though you can run multiple data load operations (all via control node) to achieve parallelism, it will be constrained on the resources of the control node. Of course, in this case scaling of SQL Data Warehouse with more compute nodes will not help much. You can consider these options as a fallback when Polybase cannot be used.
- BCP
- SQLBulkCopy
- SQL Server Integration Services (SSIS) with data flow component
- Azure Data Factory with default BULKINSERT rather than Polybase
Data Movement – Parallel
Polybase is the de-facto standard and recommended practice for loading data into SQL Data Warehouse. I talked about Polybase in greater detail in my last article.
Polybase
As discussed in my last article, PolyBase unifies data in relational data stores like Azure SQL Data Warehouse with non-relational data stores like Azure Blob storage, Azure Data Lake storage at the query level and enables seamless querying of data by using standard T-SQL query language without the requirement of additional manual processes, skills, or training as well as it allows moving data across.
If you recall from my article on SQL Data Warehouse architecture, with SQL Data Warehouse, at a minimum, you will have one control node (master) and one or more compute nodes (slaves) based on DWU tiers you have selected. The control node provides the endpoint for connecting to the SQL DW database. The control node contains an instance of SQL Database for storage of meta data (called shelldb and it doesn’t persist any user data) and an MPP engine for distributed query optimization, distributed processing and coordination. The control node manages and optimizes queries. When a request for query execution comes in, an MPP engine on the control node creates a distributed execution plan by breaking down the query into parallel processes and coordinates that processing across multiple compute nodes running in parallel. Apart from that, the control node also contains an instance of Data Management Service (DMS), which coordinates all the data movement with other nodes.
Each of the compute nodes, on the other hand, contain SQL Database and does all the heavy lifting of executing query processes (assigned to them by the control node). Each compute node works on its local data (in certain cases there might be some data movement across compute nodes to satisfy query requirements) and once it finishes processing the query, it returns its result to the control node. When the control node receives results from all the compute nodes, it then aggregates the data and returns the final result to the end-user who submitted the query for execution. DMS on compute nodes facilitates communication and data transportation across nodes in the SQL Data Warehouse.
DMS on compute nodes also includes Bridge implementation, which facilitates communication and data movement across external data sources (currently supported data sources are Azure Storage Blob and Azure Data Lake Store).
When you use Polybase, it bypasses the control node and actual data transfer happens directly between compute nodes and the data source. This means the more compute nodes you have, the better performance you are going to get. Because Polybase leverages MPP it’s a recommended practice for data movement.
Figure 1 – Data Movement with Polybase
Although Polybase is the recommended practice for data movement, you might have to choose other methods if the expected scenario is not supported in Polybase. For example:
- When you have a different number of columns in source and destination – though you might overcome this with a stage table in between.
- When you have data in a format not supported by Polybase, like fixed file format, JSON or XML, etc.
- When you have a column in your destination table of type VARCHAR(MAX), NVARCHAR(MAX) or VARBINARY(MAX).
- When you have data compressed with any other format not supported by Polybase, like WinZip, etc.
Please note, for Polybase to better parallelize a data read\load operation, make sure when you extract data from the source in the data file and compress it, you export data in multiples smaller files rather than one large single file. This is important because uncompressing the file is a single threaded operation. That means if you have data in a single large file, uncompressing might take longer than uncompressing the same amount of data in multiple files (multiple single threaded uncompressing running in parallel).
Azure Data Factory
Azure Data Factory allows you to create a workflow that can ingest data from both on-premises and cloud data stores, and transform or process data by using existing compute services such as Hadoop or Azure Batch, and publish the results to an on-premises or cloud data store for consumption. You can learn more about Azure Data Factory in “Introduction to Azure Data Factory”.
When you use Azure Data Factory, you can change the default value of “false” for the allowPolyBase property of SqlDWSink to “true” to indicate to use Polybase during data copy. You can also change different Polybase related settings with the polyBaseSettings property. You can learn more about Polybase with Azure Data Factory in “Use PolyBase to load data into Azure SQL Data Warehouse”.
"sink": { "type": "SqlDWSink", "allowPolyBase": true, "polyBaseSettings": { "rejectType": "percentage", "rejectValue": 10.0, "rejectSampleValue": 100, "useTypeDefault": true } }
SQL Server Integration Services (Control Flow Component)
SQL Server Integration Services (SSIS) is an ETL (Extract, Transform and Load) tool that allows you to extract and transform data from a wide variety of sources such as XML data files, flat files, and relational data sources, and then load the data into one or more destinations. You can learn more about SSIS in SQL Server Integration Services”.
SQL Server Integration Services (SSIS) includes several inbuilt components for doing most of the ETL operation though, if there is a need, it allows you to extend its capability by writing your own custom extensions or components.
SqlBulkCopy
If you are programmatically writing data to SQL Data Warehouse, you can use SqlBulkCopy class, which provides functionality like BCP. There are other ways to load data into a SQL Data Warehouse table (like INSERT statements), but SqlBulkCopy offers a significant performance advantage over them. It lets you efficiently bulk load a destination table with data from the source. The SqlBulkCopy class can be used to write data to SQL Data Warehouse tables. However, the data source is not limited to SQL Server or SQL Data Warehouse or SQL Database only; any data source can be used, as long as the data can be loaded to a DataTable instance or read with an IDataReader instance.
SQL Server Integration Services (Data Flow Component)
The Data Flow task is a native SSIS component, which encapsulates the data flow engine to move data from a variety of sources to a variety of destinations. SSIS includes several transformation components, which can used inside the SSIS package to transform, clean, and modify data as it is moved. Addition of a Data Flow task to a package control flow makes it possible for the package to extract, transform, and load data.
You can leverage Data Flow Task to extract data from a variety of sources, transform data in memory and load it in SQL Data Warehouse.
Typically, in cases of SQL Data Warehouse, it’s recommended to use ELT than to ETL process. This means you load data into staging tables in SQL Data Warehouse and then perform transformation in SQL Data Warehouse (which leverages the MPP engine for distributed parallel processing) and store the transformed data into the final table.
As an example, you can refer to “Load data from SQL Server into Azure SQL Data Warehouse (SSIS)”, which has details on loading data from SQL Server to SQL Data Warehouse with the SSIS data flow task component.
Azure Data Factory
Azure Data Factory uses SqlBulkCopy or BULKINSERT mechanism to load data in bulk into SQL Data Warehouse, although the data goes through the control node. If you want to change this default behavior and your data is in a supported format for Polybase you can change the settings in Azure Data Factory to use Polybase instead. Using PolyBase is an efficient way of loading large amounts of data into Azure SQL Data Warehouse with high throughput. You can see a large gain in the throughput by using PolyBase instead of the default BULKINSERT mechanism. You can refer to “Copy data to and from Azure SQL Data Warehouse using Azure Data Factory” to copy data to\from SQL Data Warehouse.
Conclusion
In this article, we looked at different ways of loading data into SQL Data Warehouse and how they work and perform.
Note, Polybase is the de-facto standard and recommended practice for loading data into SQL Data Warehouse as it bypasses control and loads data directly to compute nodes in parallel. In this scaling SQL Data Warehouse with higher DWU will yield a better load performance. In cases when you cannot use Polybase (directly or indirectly via Azure Data Factory or Azure SQL DW Upload Task in SSIS) for any reason then you should consider other methods.
Resources
Getting Started with Azure SQL Data Warehouse – Part 1
Getting Started with Azure SQL Data Warehouse – Part 2
Getting Started with Azure SQL Data Warehouse – Part 3
Getting Started with Azure SQL Data Warehouse – Part 4
Getting Started with Azure SQL Data Warehouse – Part 5