Introduction to Control Flow and Data Flow of SSIS 2008
March 22, 2010
Learn about two primary components of SQL Server 2008 Integration Services packages and their graphical representation in the Business Intelligence Development Studio
In our most recent article dedicated to the subject of Integration Services in SQL Server 2008 and 2008 R2, we described the process of executing a sample package created using Export and Import Wizard, which copied content of a view in the AdventureWorksDW database to a flat file. As part of this presentation, we imported it into a new project (based on the Integration Services Project template) within Business Intelligence Development Studio. Since our primary purpose for using this utility is designing SQL Server 2008 Integration Services-based packages, we will now focus our attention on this aspect of its functionality.
As we start exploring this subject, it is important to clarify relatively common confusion regarding two of the most fundamental concepts that deal with package design - namely control flow and data flow. Control flow deals with orderly processing of tasks, which are individual, isolated units of work that perform a specific action ending with a finite outcome (such that can be evaluated as either Success, Failure, or Completion). While their sequence can be customized by linking them into arbitrary arrangements with precedence constraints and grouping them together or repeating their execution in a loop with the help of containers, a subsequent task does not initiate unless its predecessor has completed. Data flow, on the other hand, handles its processing responsibilities by employing the pipeline paradigm, carrying data record by record (or rather, to be more accurate, memory buffer by memory buffer) from its source to a destination and modifying it in transit by applying transformations. (There are exceptions to this rule, since some of them, such as Sort or Aggregate require the ability to view the entire data set before handing it over to their downstream counterparts). Note that this does not imply that tasks cannot be executed in parallel, but rather that if they do, their actions are not coordinated (unlike processing of data flow components that are part of the same data stream). Another distinction between them is the absence of a mechanism that would allow direct transfer of data between individual control flow tasks. On the other hand, data flow lacks nesting capabilities provided by containers.
These two SQL Server Integration Services features are implemented in the form of two tabs (bearing their respective names) of the Designer interface in the Business Intelligence Development Studio. The control flow portion of a package is constructed by populating the area exposed by the first of these tabs, typically by dragging tasks and containers delivering desired functionality from the Toolbox. The same methodology is applied when adding data sources, destinations, and transformation to the area exposed by the Data Flow tab (with Toolbox adjusting its content depending on the context).
If you open our sample package created using Import and Export Wizard, which we imported into Business Intelligence Development Studio and display its logical layout by selecting the View Designer entry in its context sensitive menu, you will notice that the Control Flow tab contains only a single item labeled Data Flow Task 1. Effectively, the package consists only of a single Data Flow task, whose content you can view by double-clicking on it. This action will automatically activate the Data Flow tab, exposing, in this case, a single source (representing the vTargetMail view in AdventureWorks database) and a target (designating the text file to which we load extracted data), connected via OLE DB Source Output Data Flow Path.
Let's analyze characteristics of these three components in more detail. Display the OLE DB Source Editor dialog box by selecting the Edit... option from the context-sensitive menu of our data flow source. Its first section, labeled Connection Manager, identifies data origin and the method of accessing it, which in our case, relies on the [dbo].[vTargetMail] view and references the SourceConnectionOLEDB entry appearing on the Connection Managers tab (located at the bottom of the Designer pane). The second section, titled Columns, provides a listing of Available External Columns (those defined in the [dbo].[vTargetMail] view), allowing you to selectively exclude them (if desired), as well as alter default mappings to their respective output columns (which you also have the ability to rename). In order to fully realize the significance of this option, it is important to realize that (as we pointed out earlier) data flow operates by streaming records between two (or more) endpoints. In our example, they originate from a view in the AdventureWorks database, represented by External Columns. The responsibility of a source is to extract this external data and pass it along via its Output Columns to Input Columns of a downstream component (such as a transformation or a destination). This implies that a source must have at least one output, a destination needs at least one input, and a transformation requires at minimum one input and one output. You can verify this by reviewing the Editor interface of our Flat File Destination component, whose second section, labeled Mappings, represents the relationship between Input Columns and Destination Columns.
Another way to view this particular functionality involves the use of Advanced Editor, accessible for all components by selecting the Show Advanced Editor... entry from their context sensitive menu. The resulting dialog box is divided into four tabs, labeled Connection Managers (listing all of its connection managers), Component Properties (which mirrors the content of the Properties window), Column Mappings (equivalent to Columns section of the Editor dialog box we described earlier), and Input and Output Properties. By referencing that last tab, you will be able to quickly evaluate the number of inputs and outputs, their columns, and identify individual characteristics of each.
Note that the majority of components also contain an additional output intended for capturing errors, which in our case is configurable via the Error Output section in the Editor dialog box of the OLE DB Source component (and available for viewing via OLE DB Source Error Output entry on the Input and Output Properties tab of its Advanced Editor). Its interface gives you the ability to designate one of three actions (Ignore failure, Redirect row, or Fail component) that will be triggered in case of an Error or Truncation for each of the input or output columns.
Our source and destination are linked by an OLE DB Source Output Data Flow Path. Its Editor interface is divided into three tabs. The first of them, labeled General, gives you access to all the path's properties. The second, named Metadata, enumerates the columns that data flowing between adjacent components comprises of, listing values of Name, Data type, Precision, Scale, Length, Code Page, Sort Key Position, Comparison Flags, and Source Component properties for each (which provides valuable information in case you run into problems caused by mismatches between source and destination columns). The last of them, called Data Viewers, is also extremely useful in troubleshooting or testing data flow by allowing you to examine its content in transit. The viewer opens at runtime and displays passing records one buffer at a time, halting execution temporarily until you decide to proceed to the next one (it is also possible to copy its content to Clipboard or detach the viewer, letting the transfer complete without further interruptions). You can choose between four viewer types including:
To test this functionality, use the Add... command button in the Data Viewers section of Data Flow Path Editor. In the resulting Configure Data Viewer window, select the desired viewer type and configure its properties (by picking columns you want to monitor).
Our package can be launched either within the programming environment provided by Business Intelligence Development Studio or outside of it. The latter of these two choices requires use of the Start Without Debugging entry in the Debug menu (equivalent to the Ctrl+F5 key combination). The former can be carried out by pressing the F5 key, clicking on the button with a green triangle in the Standard toolbar, selecting the Start Debugging entry in the Debug menu, or activating the Execute Package option in the context-sensitive menu of the package entry in the Solution Explorer window. This method will provide graphical clues representing current execution status, with a yellow background of active tasks and components indicating operation in progress, green signifying successful completion, and red designating a failure. In addition, you should see a dynamically generated label appearing on the upper edge of the destination component on the Data Flow tab, which represents the total number of processed rows. You also have an option to view a detailed description of individual steps that are taking place throughout package execution by switching to the Progress tab. At this point, you can return to the design interface by selecting the Stop Debugging entry from the Debug menu, pressing the Shift+F5 key combination, or clicking on the button with a blue rectangle in the Debug toolbar (note that this will automatically change the label of the last tab of the Designer interface from Progress to Execution Results).
Clearly, our sample package has a very rudimentary structure, reflecting it wizard-based origin and rather simplistic purpose. As we have demonstrated, while our current approach to package authoring is quick and straightforward, it lacks flexibility and is rarely sufficient to carry out more complex ETL tasks. In our next article, we will focus on more elaborate scenarios (starting with importing and exporting Excel-based data) and the unique challenges they introduce.
MSDN Introducing Business Intelligence Development Studio