SQL Server 2005 - SQL Server Integration Services - Part 14 - Debugging Data Flow

In the previous installment of our series, we discussed various approaches
to debugging Control Flow components of SQL Server 2005 Integration Services.
While this satisfied our primary goal of facilitating real-time monitoring and
troubleshooting of the package workflow, it is important to note that majority
of the techniques presented so far are not applicable when dealing with data
manipulation, which takes place within the Data Flow portion of the SSIS
packages (which separate design and implementation of control and data flows –
unlike SQL Server 7.0/2000 Data Transformation Services). The purpose of this
article is to demonstrate debugging methods, which are relevant within the
context of SSIS Data Flow.

We will start with a quick overview of its characteristics, focusing on the ones
that can be used for debugging purposes. The Data Flow interface within SSIS
Designer of SQL Server 2005 Business Intelligence Development Studio features
four main types of components:

source components – serve as initial elements in the flow of
data, extracting it from a variety of sources such as Data Reader (consuming
data from .NET provider via ADO.NET-based connection manager), Excel (utilizing
Excel connection manager to access data located in Microsoft Excel worksheets),
Flat File (deriving data from delimited or fixed width text files using Flat
File connection manager), Raw File (intended for data formatted by Raw File
destination component and, therefore, requiring very limited processing), OLE
DB (offering access to various OLE DB compliant data sources via OLE DB
connection manager), or XML (capable of reading XML-formatted data files).
destination components – function as outputs in the flow of data,
delivering it to a variety of data store types. List of possible destinations
is more extensive than sources (although it does not include XML files),
providing, for example, the ability to process data using Analysis Services
techniques (such as Data Mining Model Training, Dimension Processing, or
Partition Processing).
transformation components – perform modifications of data during
its transition between source and destination components. Such changes might
involve, for example, combining data from two (or more) sources, reversing this
process by splitting it across multiple outputs, or aggregate and sort
operations. In general, transformations belong to one of two categories –
synchronous, for which modification of each individual input row results in
exactly one output row (such as sorts) – and asynchronous, for which the number
of input and output rows are different (such as aggregates).
pipeline paths – connect the output of one Data Flow component
(source or transformation) to the input of another (transformation or
destination). Even though they might seem equivalent to precedence constraints,
which provide an interface between a pair of Control Flow components, pipeline
paths do not impose constraints, but simply channel data flow.

The Data Flow tab of the SSIS Designer shows the content of an individual
Data Flow task, which constitutes an execution unit within the Control Flow
(note that while an SSIS package has a single control flow, it can have any
number of Data Tasks). Each of them is managed independently (you display the
one you intend to work with by either double clicking on its graphical
representation in the Control Flow area or by selecting Edit item from its
context sensitive menu). To populate the Data Flow area, you simply drag and
drop items from the Toolbox (which lists changes as soon as you switch away
from the Control Flow tab). The list is divided into three categories –
Sources, Transformations, and Destinations, which we presented earlier in this
article. Within the second of these categories there is a Script Component,
roughly equivalent to the Script Task, but with a fair share of unique
characteristics.

If you review list of items available in the context sensitive menus of any
of these components, you will notice that the Edit Breakpoints… option is not
available. This is a clear indication that event based breakpoints, so useful
for debugging Control Flow, did not find their way into the implementation of
Data Flow. This apparently intentional design decision is likely the result of
the fact that the majority of processing in this case is data driven, which
switches the focus of debugging away from events. Breakpoints are also ignored
when executing the Script Component code in the debugging mode, even though it
is possible to create them by following the procedure that we described when
discussing Script Task (since both of them utilize Microsoft Visual Studio for
Applications for creating and modifying Visual Basic .NET code). There are,
however, several alternatives, which compensate for lack of this functionality:

color-coded component status indicators – are the simplest one to
review and analyze and, in addition, are consistent with the ones associated
with the Data Flow component execution status, with white (indicating that the
component is idle, waiting to be invoked), yellow (designating running phase),
green (marking successful completion), or red (signaling that errors were
encountered during execution).
pipeline path data record counter – provides a quick and
convenient way to quantify the status of data processing by displaying the total
number of data records transferred along pipeline paths.
Data Viewers – assigned to pipeline paths between two components,
to allow you to view the content of records currently passing through them. (This
way you can review data directly after it is extracted from its source, before
and after it is being processed by a transformation component, or directly
before it is stored in its final destination). Viewers are created from within the
Data Flow Path Editor window (which you can activate by either selecting Data
Viewers or the Edit item from the context sensitive menu of the data flow path
that you intend to debug). Once the Editor is displayed, with Data Viewers
section active, click on the Add… command button to display the tabbed window
of the Configure Data Viewer. On the General tab, you can select one of four
different types of viewing formats – Grid (with data displayed in a tabular
format, including all rows and arbitrarily chosen columns), Histogram (with all
values from a designated column displayed as a histogram – simplifying visual
analysis of their distribution), Scatter Plot (with data values from two
columns plotted on the X and Y axis, respectively – allowing comparing pairs of
numerical values) or Column Chart (with counts of individual values from a
selected data column appearing as separate columns in a chart). The label and
content of the second tab of the Configure Data Viewer window changes according
to the type of viewer currently selected on the General tab, giving you the
ability to designate a column (or columns) to be included in a grid, histogram,
plot, or chart. Obviously your choice will largely depend on the type of data
being processed (using scatter plot or histogram might not make sense in many
situations) – but note that you can assign multiple data viewers to the same
pipeline path if you desire to do so (using any variation of data columns for
each). After a Data Viewer is created, its visual indicator (a pair of glasses
with a grid behind it) appears next to the data path to which it has been
associated.
Following the package launch (triggered using the
Start Debugging option from the Debug menu or by pressing the F5 function key),
once the data flow reaches the point where the Data Viewer component has been
added, a window with the configured viewer type (or multiple windows, if
several different viewer types have been configured) will be automatically
displayed within the Designer interface, one buffer at the time (a number of
rows per buffer depends on their size). From within the Data Viewer window, you
can click on the Continue button to load another buffer of data records, use the
Detach button to proceed without further interruptions, or copy data to the
Clipboard with the Copy Data button. Based on the statistics provided in the
status bar, you can determine the number of rows and buffers that passed the
data flow path.
Progress pane of the Designer interface (which is automatically
renamed to "Execution Results" as soon as the execution is
completed), Output window (which you can activate from the Windows submenu of
the Debug menu of the SSIS Designer interface), as well as other windows we
described in our previous article (such as Locals, Watch, Command Window,
Immediate Window, or Breakpoints) are also available and offer additional
information about the Data Flow execution. Note that Data Flow Task is
referenced in messages appearing in the Output and Progress windows as DTS.Pipeline.
Interestingly, the Breakpoint window lists all of the data viewers you created.
Row Count Data Flow Transformation component – provides the ability
to obtain the count of rows processed within Data Flow. While this is
essentially the same information that is already available based on the
pipeline path data record counter (as well as being included in Data Viewers),
the Row Count component offers two additional advantages. First of all, it
stores the row count in a variable (you define it using an already familiar
method – using Variables window within the SQL Server 2005 Business
Intelligence Development Studio – and assign it to the Row Count component from
the Component Properties tab of its Advanced Editor interface) that can be
subsequently used within Control Flow or other Data Flow tasks. (Note, however,
that the variable that stores the value calculated by the Row Count Data Flow
Transformation component gets assigned its value after its Data Flow task has
completed – this general rule applies to all SSIS variables which value is
affected by execution of Data Flow). Furthermore, you can use methods described
in our previous article to record it or follow steps described in the
Microsoft Knowledge Base article KB 906560 to write it to the Windows
Application Log. The other benefit of the Row Count component is its ability to
serve as a temporary target for data during design and testing stages (despite
the fact that it is not listed in the Destination components section of the
Toolbox).
Error Output of Data Flow components – The majority of data flow
components (with a few exceptions of sources such as the Data Reader, transformations
such as the Import Column or the OLE DB Command, or destinations such as SQL
Server Mobile) support the error output feature (represented graphically by the
red arrow pointing away from them), which allows customizing the execution
behavior when an error is encountered. This might happen in the process of
extracting data from its source, transforming its content (including data
conversion or truncation), or loading to its destination. Error Output takes
into consideration two types of situations involving data related issues –
truncation (resulting in partial data loss, which in some circumstances might
be expected or even desired) and errors (indicating typically a failure caused
by incorrect or missing data as well as mismatched conversion operation). From
the Error Output section (or Configure Error Output dialog box) of the
respective component’s Editor interface, you can specify for each data column
what action should result, in case an error and truncation takes place. Your
choices include Fail Component (the default option), Ignore Failure (delivering
data to the regular output of transformation or source), or Redirect Row
(redirecting problematic rows to error output – which can either lead to one of
the Data Flow destinations or be further processed by one of the transformation
components).
Data Flow Transformation Script Component – equivalent of the
Script Task in the Control Flow, gives you the ability to write Visual Basic
.NET code executing in the context of Data Flow, serving as a custom source, destination
or transformation component. While, as we mentioned before, breakpoints
inserted into it are ignored during execution, there are several workarounds to
this limitation. In our next article, we will describe in more detail the
characteristics and functionality of this component, as well as how its
features can be leveraged to provide more insightful debugging information.

»

See All Articles by ColumnistMarcin Policht

SQL Server 2005 – SQL Server Integration Services – Part 14 – Debugging Data Flow

Marcin Policht

Company

Categories