SQL Server 2005 – SQL Server Integration Services – Part 14 – Debugging Data Flow

In the previous installment of our series, we discussed various approaches
to debugging Control Flow components of SQL Server 2005 Integration Services.
While this satisfied our primary goal of facilitating real-time monitoring and
troubleshooting of the package workflow, it is important to note that majority
of the techniques presented so far are not applicable when dealing with data
manipulation, which takes place within the Data Flow portion of the SSIS
packages (which separate design and implementation of control and data flows –
unlike SQL Server 7.0/2000 Data Transformation Services). The purpose of this
article is to demonstrate debugging methods, which are relevant within the
context of SSIS Data Flow.

We will start with a quick overview of its characteristics, focusing on the ones
that can be used for debugging purposes. The Data Flow interface within SSIS
Designer of SQL Server 2005 Business Intelligence Development Studio features
four main types of components:

  • source components – serve as initial elements in the flow of
    data, extracting it from a variety of sources such as Data Reader (consuming
    data from .NET provider via ADO.NET-based connection manager), Excel (utilizing
    Excel connection manager to access data located in Microsoft Excel worksheets),
    Flat File (deriving data from delimited or fixed width text files using Flat
    File connection manager), Raw File (intended for data formatted by Raw File
    destination component and, therefore, requiring very limited processing), OLE
    DB (offering access to various OLE DB compliant data sources via OLE DB
    connection manager), or XML (capable of reading XML-formatted data files).

  • destination components – function as outputs in the flow of data,
    delivering it to a variety of data store types. List of possible destinations
    is more extensive than sources (although it does not include XML files),
    providing, for example, the ability to process data using Analysis Services
    techniques (such as Data Mining Model Training, Dimension Processing, or
    Partition Processing).

  • transformation components – perform modifications of data during
    its transition between source and destination components. Such changes might
    involve, for example, combining data from two (or more) sources, reversing this
    process by splitting it across multiple outputs, or aggregate and sort
    operations. In general, transformations belong to one of two categories –
    synchronous, for which modification of each individual input row results in
    exactly one output row (such as sorts) – and asynchronous, for which the number
    of input and output rows are different (such as aggregates).

  • pipeline paths – connect the output of one Data Flow component
    (source or transformation) to the input of another (transformation or
    destination). Even though they might seem equivalent to precedence constraints,
    which provide an interface between a pair of Control Flow components, pipeline
    paths do not impose constraints, but simply channel data flow.

The Data Flow tab of the SSIS Designer shows the content of an individual
Data Flow task, which constitutes an execution unit within the Control Flow
(note that while an SSIS package has a single control flow, it can have any
number of Data Tasks). Each of them is managed independently (you display the
one you intend to work with by either double clicking on its graphical
representation in the Control Flow area or by selecting Edit item from its
context sensitive menu). To populate the Data Flow area, you simply drag and
drop items from the Toolbox (which lists changes as soon as you switch away
from the Control Flow tab). The list is divided into three categories –
Sources, Transformations, and Destinations, which we presented earlier in this
article. Within the second of these categories there is a Script Component,
roughly equivalent to the Script Task, but with a fair share of unique
characteristics.

If you review list of items available in the context sensitive menus of any
of these components, you will notice that the Edit Breakpoints… option is not
available. This is a clear indication that event based breakpoints, so useful
for debugging Control Flow, did not find their way into the implementation of
Data Flow. This apparently intentional design decision is likely the result of
the fact that the majority of processing in this case is data driven, which
switches the focus of debugging away from events. Breakpoints are also ignored
when executing the Script Component code in the debugging mode, even though it
is possible to create them by following the procedure that we described when
discussing Script Task (since both of them utilize Microsoft Visual Studio for
Applications for creating and modifying Visual Basic .NET code). There are,
however, several alternatives, which compensate for lack of this functionality:

  • color-coded component status indicators – are the simplest one to
    review and analyze and, in addition, are consistent with the ones associated
    with the Data Flow component execution status, with white (indicating that the
    component is idle, waiting to be invoked), yellow (designating running phase),
    green (marking successful completion), or red (signaling that errors were
    encountered during execution).

  • pipeline path data record counter – provides a quick and
    convenient way to quantify the status of data processing by displaying the total
    number of data records transferred along pipeline paths.

  • Data Viewers – assigned to pipeline paths between two components,
    to allow you to view the content of records currently passing through them. (This
    way you can review data directly after it is extracted from its source, before
    and after it is being processed by a transformation component, or directly
    before it is stored in its final destination). Viewers are created from within the
    Data Flow Path Editor window (which you can activate by either selecting Data
    Viewers or the Edit item from the context sensitive menu of the data flow path
    that you intend to debug). Once the Editor is displayed, with Data Viewers
    section active, click on the Add… command button to display the tabbed window
    of the Configure Data Viewer. On the General tab, you can select one of four
    different types of viewing formats – Grid (with data displayed in a tabular
    format, including all rows and arbitrarily chosen columns), Histogram (with all
    values from a designated column displayed as a histogram – simplifying visual
    analysis of their distribution), Scatter Plot (with data values from two
    columns plotted on the X and Y axis, respectively – allowing comparing pairs of
    numerical values) or Column Chart (with counts of individual values from a
    selected data column appearing as separate columns in a chart). The label and
    content of the second tab of the Configure Data Viewer window changes according
    to the type of viewer currently selected on the General tab, giving you the
    ability to designate a column (or columns) to be included in a grid, histogram,
    plot, or chart. Obviously your choice will largely depend on the type of data
    being processed (using scatter plot or histogram might not make sense in many
    situations) – but note that you can assign multiple data viewers to the same
    pipeline path if you desire to do so (using any variation of data columns for
    each). After a Data Viewer is created, its visual indicator (a pair of glasses
    with a grid behind it) appears next to the data path to which it has been
    associated.

    Following the package launch (triggered using the
    Start Debugging option from the Debug menu or by pressing the F5 function key),
    once the data flow reaches the point where the Data Viewer component has been
    added, a window with the configured viewer type (or multiple windows, if
    several different viewer types have been configured) will be automatically
    displayed within the Designer interface, one buffer at the time (a number of
    rows per buffer depends on their size). From within the Data Viewer window, you
    can click on the Continue button to load another buffer of data records, use the
    Detach button to proceed without further interruptions, or copy data to the
    Clipboard with the Copy Data button. Based on the statistics provided in the
    status bar, you can determine the number of rows and buffers that passed the
    data flow path.

  • Progress pane of the Designer interface (which is automatically
    renamed to "Execution Results" as soon as the execution is
    completed), Output window (which you can activate from the Windows submenu of
    the Debug menu of the SSIS Designer interface), as well as other windows we
    described in our previous article (such as Locals, Watch, Command Window,
    Immediate Window, or Breakpoints) are also available and offer additional
    information about the Data Flow execution. Note that Data Flow Task is
    referenced in messages appearing in the Output and Progress windows as DTS.Pipeline.
    Interestingly, the Breakpoint window lists all of the data viewers you created.

  • Row Count Data Flow Transformation component – provides the ability
    to obtain the count of rows processed within Data Flow. While this is
    essentially the same information that is already available based on the
    pipeline path data record counter (as well as being included in Data Viewers),
    the Row Count component offers two additional advantages. First of all, it
    stores the row count in a variable (you define it using an already familiar
    method – using Variables window within the SQL Server 2005 Business
    Intelligence Development Studio – and assign it to the Row Count component from
    the Component Properties tab of its Advanced Editor interface) that can be
    subsequently used within Control Flow or other Data Flow tasks. (Note, however,
    that the variable that stores the value calculated by the Row Count Data Flow
    Transformation component gets assigned its value after its Data Flow task has
    completed – this general rule applies to all SSIS variables which value is
    affected by execution of Data Flow). Furthermore, you can use methods described
    in our previous article to record it or follow steps described in the
    Microsoft Knowledge Base article KB 906560
    to write it to the Windows
    Application Log. The other benefit of the Row Count component is its ability to
    serve as a temporary target for data during design and testing stages (despite
    the fact that it is not listed in the Destination components section of the
    Toolbox).

  • Error Output of Data Flow components – The majority of data flow
    components (with a few exceptions of sources such as the Data Reader, transformations
    such as the Import Column or the OLE DB Command, or destinations such as SQL
    Server Mobile) support the error output feature (represented graphically by the
    red arrow pointing away from them), which allows customizing the execution
    behavior when an error is encountered. This might happen in the process of
    extracting data from its source, transforming its content (including data
    conversion or truncation), or loading to its destination. Error Output takes
    into consideration two types of situations involving data related issues –
    truncation (resulting in partial data loss, which in some circumstances might
    be expected or even desired) and errors (indicating typically a failure caused
    by incorrect or missing data as well as mismatched conversion operation). From
    the Error Output section (or Configure Error Output dialog box) of the
    respective component’s Editor interface, you can specify for each data column
    what action should result, in case an error and truncation takes place. Your
    choices include Fail Component (the default option), Ignore Failure (delivering
    data to the regular output of transformation or source), or Redirect Row
    (redirecting problematic rows to error output – which can either lead to one of
    the Data Flow destinations or be further processed by one of the transformation
    components).

  • Data Flow Transformation Script Component – equivalent of the
    Script Task in the Control Flow, gives you the ability to write Visual Basic
    .NET code executing in the context of Data Flow, serving as a custom source, destination
    or transformation component. While, as we mentioned before, breakpoints
    inserted into it are ignored during execution, there are several workarounds to
    this limitation. In our next article, we will describe in more detail the
    characteristics and functionality of this component, as well as how its
    features can be leveraged to provide more insightful debugging information.

»


See All Articles by Columnist
Marcin Policht

Marcin Policht
Marcin Policht
Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Latest Articles