SQL Server 2005 Integration Services – Performance – Part 30

Throughout our coverage of SQL Server 2005 Integration Services, we have
been primarily concerned with newly introduced and improved features, spending
considerably less time discussing its performance advantages over Data
Transformation Services. However, it is important to point out that superiority
of this technology goes beyond functionality alone, providing numerous
efficiency benefits, which we will present in more detail in this article.

As we already have explained in the introductory installments of this
series, one of the fundamental changes underlying design and implementation of
SQL Server 2005 Integration Services is separation of Control Flow and Data
Flow. The former determines execution sequence of tasks, defining workflow of
the package. The primary purpose of the latter is management of the data
movement from its source to destination and transformations of its content
during transit. Some of the most relevant performance tuning methods are
directly related to this dual architecture and therefore should be discussed
within its context.

Control Flow execution is managed by the SSIS run-time engine, which offers the
ability to run its tasks in parallel (in addition to traditional serial
approach, which tends to be less efficient, since it forces tasks to execute in
synchronous manner, where a subsequent one needs to wait for its predecessor to
complete before it can proceed). The maximum number of tasks that can execute
simultaneously is controlled with the MaxConcurrentExecutables package
property. You can set it explicitly or rely on its default value of -1, which
derives the limit by adding 2 to the total number of processors (logical rather
than physical, giving you a slight edge when hyperthreading is turned on).
Note, however, that you should not assign this number strictly based on the
number of tasks and their respective arrangement, but also take into
consideration the characteristics of individual tasks (such as their resource
utilization or blocking nature). Similarly, within Data Flow, it is possible to
benefit from parallelism. The total number of threads created by the pipeline
engine is calculated based on several factors, such as the number of data sources,
destinations, and asynchronous
transformations
. While you can affect the outcome of these calculations by
assigning an arbitrary value to the EngineThreads property of each Data Flow
task (which, by default, is set to 5), this serves merely as a suggestion to
the engine scheduler, which might make additional adjustments as it deems
appropriate (for example, in cases where this would eliminate the potential for
deadlocks). Alternative method of influencing thread distribution involves
modifying the pipeline by introducing additional transformations, such as
Multicast (which copies its input across multiple outputs), spreading the load
across multiple transformations and destinations.

In addition to improvements in thread organization, Data Flow can be
optimized by tuning its memory utilization. Data processed by its engine is
stored in chunks of memory referred to as buffers, created as new records are
read via source components and allocated dynamically according to requirements
imposed by transformations. The volume of records in a buffer depends on a
variety of factors, such as the number and width of data columns, as well as
buffer size (more about this shortly). The total number of buffers is derived
based primarily on a total count of data records (extracted from their source
or generated throughout the pipeline) and blocking nature of Data Flow
components. Such components can be viewed as non-blocking, semi-blocking, or
blocking, depending on whether data rows are relayed from one transformation to
the next after each one, some, or all of them have been processed. The
non-blocking type can be further subdivided into streaming (such as Character
Map, Conditional Split, or Data Conversion, which use readily available
internal features and data) and row-based (such as Export Column, Import Column
or OLE DB Command, which are inherently slower since they need to rely on
external entities for the same functionality). In either case though,
processing is very efficient, with data rows maintained in the same set of
buffers throughout the pipeline, handled in-place by consecutive
transformations (without the need to allocate new buffers and copy partially
processed data to them). Semi-blocking components (such as Merge or Merge Join)
require that records expected by downstream transformation are temporarily
retained (for example, Merge and Merge Join must first locate matching keys
from data streams being merged before producing the outcome) resulting
typically in the difference between counts of input and output rows. This, in
turn, impacts memory usage (and tends to increase thread count), since new
buffers need to be created (and existing ones maintained) in order to
accommodate some of the rows processed by an upstream transformation. Fully
blocking Data Flow components (such as Aggregate or Sort) where the entire
input data set is needed before any output records are generated are the worst
performers, since it is necessary to create sufficient amount buffers (and
dedicate processing resources to perform their calculations) to hold the entire
data set.

As mentioned earlier, buffers are formed on an as needed basis, according to
the count and size of individual data rows (which is calculated by reviewing the
data types of columns involved in transformations) and following rules dictated
by several Data Flow-specific parameters, which include:

  • DefaultBufferSize – set by default (but configurable) to
    10,485,760 Bytes (equal to 10 MB), which specifies maximum amount of memory
    that can be used for a single memory buffer.

  • DefaultBufferMaxRows – set by default (but configurable) to
    10,000, which limits the number of rows that can be stored in a single buffer.
    If the result of multiplying this value by the estimated row size is larger
    than DefaultBufferSize, then the actual number of rows is automatically reduced
    to fit within the designated range.

  • MaxBufferSize – hard-coded to 100 MB (and not configurable) sets
    the upper limit on the DefaultBufferSize (you will receive an error message if
    you attempt to increase DefaultBufferSize beyond this value). For this reason,
    it also affects the maximum number of rows that can be loaded into a single
    buffer (even if a larger number would be allowed based on the value assigned to
    DefaultBufferMaxRows).

  • MinBufferSize – defined by operating system features (in
    particular, by its virtual memory allocation mechanism), determines the
    smallest possible value of DefaultBufferSize (and is typically equal to 65,536
    Bytes – or 64 kB). If the amount of memory calculated by multiplying DefaultBufferMaxRows
    by the estimated row size is lower than MinBufferSize, than the number of
    allowed rows is automatically increased to ensure that the buffer size is not smaller
    than this absolute minimum.

Buffer and thread characteristics are reflected by the structure of
execution trees, generated during package execution. Each such tree is
associated with a separate buffer and, potentially, a distinct thread. You can
determine their structure and layout by turning on logging for your Data Flow
tasks and monitoring the BufferSizeTuning, PipelineExecutionTrees, and PipelineExecutionPlan
events, (the outcome is displayed in the Log Events window and recorded using
the log provider you selected). In addition, you should also review the results
of the Progress tab, where you can identify any redundant columns that, while
included in the buffer, might not be needed by any downstream transformations
(relevant messages pointing to this issue are displayed in the form of
warnings). Eliminating them from the pipeline minimizes the value of estimated
row size, optimizing buffer utilization. This can be further augmented by
ensuring that appropriate data types are specified when loading content of flat
files. (You can use for this purpose the Suggest Types button in the Advanced
section Flat File Connection Manager Editor dialog box, which provides advice
regarding choices of smallest integer and real data types as well as identifies
Boolean columns based on a data sample of arbitrary size). You might also want
to adjust the EngineThreads parameter (mentioned before) in order to allow a sufficient
number of threads to cover all execution trees and source components.

In order to determine the efficiency of your tuning techniques, you might
want to monitor performance of Data Flow transformations and buffer utilization
with Windows built-in System Monitor. This utility (which you can invoke by selecting
Performance entry from the Administrative Tools menu or by simply typing
Perfmon in the Start->Run box) displays real-time statistics based on the
set of arbitrarily selected counters. The ones most relevant in the context of
our discussion include Buffer memory (total memory used by buffers), Buffers in
use (number of buffer objects) and Buffers spooled (which you should pay
particular attention to, since its increase indicates shortage of physical
memory, forcing the buffer content into the Swap file on the hard drive, which
has significance adverse impact on peformance). More details can be obtained by
looking at equivalent statistics for flat and private buffers, such as Flat
buffer memory, Flat buffers in use, Private buffer memory, and Private buffers
in use. (Flat buffers are used in the majority of situations as storage for
data passing through the pipeline, while private ones are required by few
blocking transformations, such as Sorts or Lookups to cache data needed for
their temporary calculations). In order to display any of them in the System
Monitor interface, select the Add counters… item in its context sensitive
menu (or click on the "+" sign in the toolbar), pick SQL Server:SSIS
Pipeline from the Performance objects drop-down list, and choose the relevant
entries from the listbox underneath. Keep in mind, however, that all of the
values are server-wide, which might make their analysis difficult, unless you
have full control over all packages running on the system and can schedule
their execution in such way that other tasks do not impact measurements.

»


See All Articles by Columnist
Marcin Policht

Marcin Policht
Marcin Policht
Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Latest Articles