Learn about two primary components of SQL Server 2008 Integration Services packages and their graphical representation in the Business Intelligence Development Studio
In our most recent article dedicated to
the subject of Integration Services in SQL Server 2008 and 2008 R2, we
described the
process of executing a sample package created using Export and Import Wizard, which
copied content of a view in the AdventureWorksDW database to a flat file. As part of
this presentation, we imported it into a new project (based on the Integration Services Project template)
within Business
Intelligence Development Studio. Since our primary purpose for using this utility
is designing SQL Server 2008 Integration Services-based packages, we will now
focus our attention on this aspect of its functionality.
As we start exploring this subject, it
is important to clarify relatively common confusion regarding two of the most
fundamental concepts that deal with package design – namely control flow
and data flow. Control flow deals with orderly processing of tasks, which are
individual, isolated units of work that perform a specific action ending with a
finite outcome (such that can be evaluated as either Success, Failure, or Completion). While their
sequence can be customized by linking them into arbitrary arrangements with precedence constraints and grouping
them together or repeating their execution in a loop with the help of containers, a subsequent
task does not initiate unless its predecessor has completed. Data flow, on the
other hand, handles its processing responsibilities by employing the pipeline
paradigm, carrying data record by record (or rather, to be more accurate,
memory buffer by memory buffer) from its source to a destination and modifying it in transit by
applying transformations. (There are
exceptions to this rule, since some of them, such as Sort or Aggregate require the
ability to view the entire data set before handing it over to their downstream
counterparts). Note that this does not imply that tasks cannot be executed in
parallel, but rather that if they do, their actions are not coordinated (unlike
processing of data flow components that are part of the same data stream).
Another distinction between them is the absence of a mechanism that would allow
direct transfer of data between individual control flow tasks. On the other
hand, data flow lacks nesting capabilities provided by containers.
These two SQL Server Integration Services
features are implemented in the form of two tabs (bearing their respective
names) of the Designer interface in
the Business
Intelligence Development Studio. The control flow portion of a package is
constructed by populating the area exposed by the first of these tabs,
typically by dragging tasks and containers delivering desired functionality
from the Toolbox. The same
methodology is applied when adding data sources, destinations, and transformation to the area exposed by the Data Flow tab (with Toolbox adjusting its
content depending on the context).
If you open our sample package created
using Import
and Export Wizard,
which we imported into Business
Intelligence Development Studio and display its logical layout by selecting the View Designer entry in its
context sensitive menu, you will notice that the Control Flow tab contains only a single item
labeled Data
Flow Task 1.
Effectively, the package consists only of a single Data Flow task, whose content
you can view by double-clicking on it. This action will automatically activate
the Data Flow tab, exposing,
in this case, a single source (representing the vTargetMail view in AdventureWorks database) and a
target (designating the text file to which we load extracted data), connected
via OLE DB
Source Output Data Flow Path.
Let’s analyze characteristics of these
three components in more detail. Display the OLE DB Source Editor dialog box by
selecting the Edit… option from the
context-sensitive menu of our data flow source. Its first section, labeled Connection Manager, identifies
data origin and the method of accessing it, which in our case, relies on the [dbo].[vTargetMail] view and references
the SourceConnectionOLEDB entry appearing
on the Connection
Managers
tab (located at the bottom of the Designer pane). The second section, titled Columns, provides a
listing of Available
External Columns
(those defined in the [dbo].[vTargetMail] view), allowing
you to selectively exclude them (if desired), as well as alter default mappings
to their respective output columns (which you also have the ability to rename).
In order to fully realize the significance of this option, it is important to
realize that (as we pointed out earlier) data flow operates by streaming
records between two (or more) endpoints. In our example, they originate from a
view in the AdventureWorks database,
represented by External
Columns.
The responsibility of a source is to extract
this external data and pass it along via its Output Columns to Input Columns of a downstream
component (such as a transformation or a destination). This implies
that a source must have at least one output, a destination needs at least one
input, and a transformation requires at minimum one input and one output. You
can verify this by reviewing the Editor interface of our Flat File Destination component, whose
second section, labeled Mappings, represents the
relationship between Input
Columns
and Destination
Columns.
Another way to view this particular
functionality involves the use of Advanced Editor, accessible for all components by
selecting the Show
Advanced Editor…
entry from their context sensitive menu. The resulting dialog box is divided
into four tabs, labeled Connection
Managers
(listing all of its connection managers), Component Properties (which mirrors the content of
the Properties window), Column Mappings (equivalent to Columns section of the Editor dialog box we
described earlier), and Input
and Output Properties.
By referencing that last tab, you will be able to quickly evaluate the number
of inputs and outputs, their columns, and identify individual characteristics
of each.
Note that the majority of components
also contain an additional output intended for capturing errors, which in our
case is configurable via the Error Output section in the Editor dialog box of the OLE DB Source component (and
available for viewing via OLE DB Source Error Output entry on the Input and Output Properties tab of its Advanced Editor). Its interface
gives you the ability to designate one of three actions (Ignore failure, Redirect row, or Fail component) that will be
triggered in case of an Error or Truncation for each of the
input or output columns.
Our source and destination are linked by an OLE DB Source Output Data
Flow Path.
Its Editor interface is
divided into three tabs. The first of them, labeled General, gives you
access to all the path’s properties. The second, named Metadata, enumerates the columns
that data flowing between adjacent components comprises of, listing values of Name, Data type, Precision, Scale, Length, Code Page, Sort Key Position, Comparison Flags, and Source Component properties for
each (which provides valuable information in case you run into problems caused
by mismatches between source and destination columns). The last of them, called
Data Viewers, is also
extremely useful in troubleshooting or testing data flow by allowing you to
examine its content in transit. The viewer opens at runtime and displays
passing records one buffer at a time, halting execution temporarily until you
decide to proceed to the next one (it is also possible to copy its content to Clipboard or detach the
viewer, letting the transfer complete without further interruptions). You can
choose between four viewer types including:
- Grid (presenting columns you specify in
a tabular format) - Histogram (intended for viewing distribution
of numeric values only) - Scatter Plot (x,y) (using a
pair of numeric columns to express their relationship in x and y axis
layout) - Column Chart (calculating occurrences of
identical values within a selected column and displaying them side by
side).
To test this functionality, use the Add… command button
in the Data
Viewers
section of Data
Flow Path Editor.
In the resulting Configure
Data Viewer
window, select the desired viewer type and configure its properties (by picking
columns you want to monitor).
Our package can be launched either
within the programming environment provided by Business Intelligence Development Studio or outside of
it. The latter of these two choices requires use of the Start Without Debugging entry in the Debug menu (equivalent
to the Ctrl+F5 key
combination). The former can be carried out by pressing the F5 key, clicking
on the button with a green triangle in the Standard toolbar, selecting the Start Debugging entry in the Debug menu, or
activating the Execute
Package
option in the context-sensitive menu of the package entry in the Solution Explorer window. This
method will provide graphical clues representing current execution status, with
a yellow background of active tasks and components indicating operation in
progress, green signifying successful completion, and red designating a
failure. In addition, you should see a dynamically generated label appearing on
the upper edge of the destination component on
the Data Flow tab, which
represents the total number of processed rows. You also have an option to view a
detailed description of individual steps that are taking place throughout
package execution by switching to the Progress tab. At this point, you can return to the
design interface by selecting the Stop Debugging entry from the Debug menu, pressing
the Shift+F5 key
combination, or clicking on the button with a blue rectangle in the Debug toolbar (note
that this will automatically change the label of the last tab of the Designer interface from Progress to Execution Results).
Clearly, our sample package has a very
rudimentary structure, reflecting it wizard-based origin and rather simplistic
purpose. As we have demonstrated, while our current approach to package
authoring is quick and straightforward, it lacks flexibility and is rarely
sufficient to carry out more complex ETL tasks. In our next article, we will
focus on more elaborate scenarios (starting with importing and exporting
Excel-based data) and the unique challenges they introduce.
Additional Resources
MSDN Introducing
Business Intelligence Development Studio