Introducing SQL Server 2008 and 2008 R2 Integration Services

SQL Server platform comprises of several distinct, highly efficient
components, which include Database Engine (with its Service Broker and
Replication mechanisms), Analysis Services (incorporating Data Mining and
Multi-dimensional Data methodologies), and Reporting Services. All of them are
not only highly optimized for the purpose of accomplishing their intended
goals, but also capable of interacting with each other. While some of this interaction
is inherent to their individual design, its most powerful features are
incorporated into the fourth main component, known as SQL Server Integration
Services (SSIS). In addition to providing infrastructure for exchanging and
manipulating data internal to SQL Server, SSIS also excels in performing
equivalent tasks when dealing with wide variety of external data stores. The
goal of our article is to describe its general characteristics.

SQL Server Integration Services technology is frequently described using the
term Extraction, Transformation, and Loading (ETL), which conveniently
summarizes its primary features, although it hardly reflects their breadth and
versatility. In particular, the level of flexibility and customization that
became available in SQL Server 2008 has been enhanced by tighter integration
with Visual Studio, including full support for .NET programming languages (both
VB.NET and C#), as well as project-based code management and versioning. At the
same rate, however, the need for custom development has been minimized by
extending the selection of built-in features that can be employed to deliver
desired functionality.

The current version of the product constitutes a major departure from its
original implementation introduced in SQL Server 7.0, known as Data
Transformation Services (for more information about DTS, refer to our
earlier series of articles
). The first surge of significant architectural
and functional improvements (which you can read about in our comprehensive
coverage of this subject on the Database
Journal Web site
) has been applied to the SQL Server 2005 release,
justifying rebranding DTS to its current name. That moment also marked the
inception of Business Intelligence Development Studio as the primary tool for
managing ETL functionality. Its interface, consistent with the Integrated
Development Environment (IDE) of Visual Studio, was a clear indication of the
shift in the approach to database design and management, signifying the need
for increased programming knowledge among database administrators. While the
latest renditions (especially the R2 release) do not bring changes as
revolutionary as those characterizing its predecessor, there are numerous
innovations worthy of our attention that will be covered in our upcoming
articles (one of them is the introduction of Visual Studio Tools for
Applications, which further simplifies the Microsoft.NET Framework-based
development). More importantly, SSIS further strengthens its position as one of
the primary foundations of Business Intelligence, delivering a powerful
framework for solutions that combine data from disparate sources, facilitating
its analysis and reporting. Focus on the Self-Service aspect of the product
empowers end-users, not only simplifying creation of such solutions but also
lessening the burden carried commonly by Information Technology departments.

In order to gain an understanding of the principles governing the behavior
of SQL Server Integration Services, we first need to briefly describe its
elementary concepts. As we have already pointed out, the underlying purpose of
this technology is to deliver a combination of data extraction, transformation,
and loading activities. The corresponding work is encapsulated into a software
construct called a package.
Effectively, such package
becomes a collection of SSIS elements that can be (as a whole) independently
executed, saved, or retrieved (its implementation takes the form of an XML-formatted .DTSX file, which is stored either
internally within SQL Server or externally, as part of the operating
system-managed file system). In general, objects included within a package
belong to one of two categories.

The first of these categories, referred to as Control Flow, dictates the sequence of execution of
specialized processing units called tasks
(each of them is responsible for performing a designated step, such as running
a T-SQL command, launching an external process, handling a file system
operation, or invoking a .NET-based code). Tasks are interlinked using precedence constraints, which define
package workflow by determining a downstream task (such that a resulting action
depends on the outcome of an upstream task or on the value of an arbitrary,
dynamically evaluated expression). Tasks can be grouped using containers, which provide additional flexibility
in customizing the processing sequence (for example, by applying For or ForEach loop iterations). It is also possible to further
modify behavior of elements enclosed within a package by taking advantage of
variables, whose values are assigned either prior or during its execution,
(variables also facilitate package reusability and simplify its transfer across
different environments by providing a single location where configuration
changes that need to be applied during such transfers are stored).

While each task has a unique role in covering a full spectrum of extraction,
transformation, and loading activities, Data
Flow Task
stands out in regard to its significance and popularity.
(This dominance is reflected by its distinct visual representation in the
Business Intelligence Development Studio interface, which we will describe in
detail in our upcoming articles). This task represents the second category of
package objects, known as Data Flow.
Its primary purpose is to manipulate data being transferred between its source and destination by applying any number of transformations (such as aggregation,
duplication, grouping, lookups, splits, merges, or samplings, to mention just a
few). Definitions of source
and destination data stores
rely on Connection Managers,
available to both Control Flow
and Data Flow components,
and facilitate access to a wide range of data formats (including those defined
using ADO.NET and OLE DB providers). If desired, it is possible to hide
complexities of underlying storage (simplifying its analysis) by employing data source views. Flexible error and event handling, as well as transactional support increase resiliency and robustness
of package execution. Troubleshooting can be considerably simplified by
employing detailed and highly customizable debugging
and logging mechanisms.

The dual nature of SSIS packages is reflected by the similarly structured
architecture of underlying system components. More specifically, SSIS divides
its responsibilities between the runtime engine (which implements control flow
functionality with such features as transactional support, debugging, event
handling, or logging), and the data flow engine (which facilitates activities
defined within Data Flow Tasks,
covering each of their stages, from extraction through transformation to
loading). Operating system-level service (labeled SQL Server Integration Services 10.0) deals with storage
of packages and monitors their overall execution status.

While each edition of SQL Server 2008 and 2008 R2 is capable of executing
SSIS packages (thanks to the presence of Integration Services runtime), the
level of support for more advanced ETL-related functionality varies. In
particular, from the package authoring perspective, Workgroup, Web, and Express
Editions are limited to the interactive Import
and Export Wizard
with its basic choice of sources,
transformations, and destinations, but without access to Business Intelligence
Development Studio with its Designer interface. Both SQL Server Standard and
Enterprise Editions offer fully functional Integrated Development Environment
(exposed via Business Intelligence Development Studio and its Designer
interface) and matching performance capabilities without artificially imposed
throttling (performance boundaries are imposed, however, by the 4 processor
limit of Standard Edition and restrictions inherent to the operating system on
which the SQL Server instance is hosted). Either of them will let you
experience the most interesting innovations introduced in the new platform,
including the Data Profiling
task (facilitating capturing and analyzing data processed by SSIS packages) or Cache Transform transformation, which
delivers performance improvements during lookup operations. The primary
distinction between the two products is support for a number of more
sophisticated Data Flow Task
transformations (Data Mining Query,
Fuzzy Grouping, Fuzzy Lookup, Term Extraction, or Term Lookup) and destinations (Partition Processing, Dimension Processing, or Data Mining Model Training), as well as Data Mining Query Task, which are
available only in the Enterprise Edition.

In our next article, we will take a closer look at the initial setup and
configuration of Integration Services in both SQL Server 2008 and 2008 R2 (note
that the latter is based on the November 2009 Community Technology Preview).

»


See All Articles by Columnist
Marcin Policht

Marcin Policht
Marcin Policht
Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Latest Articles