Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum Rss Feed

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
Free Newsletters:

News Via RSS Feed

Rss Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles


Posted Jan 25, 2010

Introducing SQL Server 2008 and 2008 R2 Integration Services

By Marcin Policht

SQL Server platform comprises of several distinct, highly efficient components, which include Database Engine (with its Service Broker and Replication mechanisms), Analysis Services (incorporating Data Mining and Multi-dimensional Data methodologies), and Reporting Services. All of them are not only highly optimized for the purpose of accomplishing their intended goals, but also capable of interacting with each other. While some of this interaction is inherent to their individual design, its most powerful features are incorporated into the fourth main component, known as SQL Server Integration Services (SSIS). In addition to providing infrastructure for exchanging and manipulating data internal to SQL Server, SSIS also excels in performing equivalent tasks when dealing with wide variety of external data stores. The goal of our article is to describe its general characteristics.

SQL Server Integration Services technology is frequently described using the term Extraction, Transformation, and Loading (ETL), which conveniently summarizes its primary features, although it hardly reflects their breadth and versatility. In particular, the level of flexibility and customization that became available in SQL Server 2008 has been enhanced by tighter integration with Visual Studio, including full support for .NET programming languages (both VB.NET and C#), as well as project-based code management and versioning. At the same rate, however, the need for custom development has been minimized by extending the selection of built-in features that can be employed to deliver desired functionality.

The current version of the product constitutes a major departure from its original implementation introduced in SQL Server 7.0, known as Data Transformation Services (for more information about DTS, refer to our earlier series of articles). The first surge of significant architectural and functional improvements (which you can read about in our comprehensive coverage of this subject on the Database Journal Web site) has been applied to the SQL Server 2005 release, justifying rebranding DTS to its current name. That moment also marked the inception of Business Intelligence Development Studio as the primary tool for managing ETL functionality. Its interface, consistent with the Integrated Development Environment (IDE) of Visual Studio, was a clear indication of the shift in the approach to database design and management, signifying the need for increased programming knowledge among database administrators. While the latest renditions (especially the R2 release) do not bring changes as revolutionary as those characterizing its predecessor, there are numerous innovations worthy of our attention that will be covered in our upcoming articles (one of them is the introduction of Visual Studio Tools for Applications, which further simplifies the Microsoft.NET Framework-based development). More importantly, SSIS further strengthens its position as one of the primary foundations of Business Intelligence, delivering a powerful framework for solutions that combine data from disparate sources, facilitating its analysis and reporting. Focus on the Self-Service aspect of the product empowers end-users, not only simplifying creation of such solutions but also lessening the burden carried commonly by Information Technology departments.

In order to gain an understanding of the principles governing the behavior of SQL Server Integration Services, we first need to briefly describe its elementary concepts. As we have already pointed out, the underlying purpose of this technology is to deliver a combination of data extraction, transformation, and loading activities. The corresponding work is encapsulated into a software construct called a package. Effectively, such package becomes a collection of SSIS elements that can be (as a whole) independently executed, saved, or retrieved (its implementation takes the form of an XML-formatted .DTSX file, which is stored either internally within SQL Server or externally, as part of the operating system-managed file system). In general, objects included within a package belong to one of two categories.

The first of these categories, referred to as Control Flow, dictates the sequence of execution of specialized processing units called tasks (each of them is responsible for performing a designated step, such as running a T-SQL command, launching an external process, handling a file system operation, or invoking a .NET-based code). Tasks are interlinked using precedence constraints, which define package workflow by determining a downstream task (such that a resulting action depends on the outcome of an upstream task or on the value of an arbitrary, dynamically evaluated expression). Tasks can be grouped using containers, which provide additional flexibility in customizing the processing sequence (for example, by applying For or ForEach loop iterations). It is also possible to further modify behavior of elements enclosed within a package by taking advantage of variables, whose values are assigned either prior or during its execution, (variables also facilitate package reusability and simplify its transfer across different environments by providing a single location where configuration changes that need to be applied during such transfers are stored).

While each task has a unique role in covering a full spectrum of extraction, transformation, and loading activities, Data Flow Task stands out in regard to its significance and popularity. (This dominance is reflected by its distinct visual representation in the Business Intelligence Development Studio interface, which we will describe in detail in our upcoming articles). This task represents the second category of package objects, known as Data Flow. Its primary purpose is to manipulate data being transferred between its source and destination by applying any number of transformations (such as aggregation, duplication, grouping, lookups, splits, merges, or samplings, to mention just a few). Definitions of source and destination data stores rely on Connection Managers, available to both Control Flow and Data Flow components, and facilitate access to a wide range of data formats (including those defined using ADO.NET and OLE DB providers). If desired, it is possible to hide complexities of underlying storage (simplifying its analysis) by employing data source views. Flexible error and event handling, as well as transactional support increase resiliency and robustness of package execution. Troubleshooting can be considerably simplified by employing detailed and highly customizable debugging and logging mechanisms.

The dual nature of SSIS packages is reflected by the similarly structured architecture of underlying system components. More specifically, SSIS divides its responsibilities between the runtime engine (which implements control flow functionality with such features as transactional support, debugging, event handling, or logging), and the data flow engine (which facilitates activities defined within Data Flow Tasks, covering each of their stages, from extraction through transformation to loading). Operating system-level service (labeled SQL Server Integration Services 10.0) deals with storage of packages and monitors their overall execution status.

While each edition of SQL Server 2008 and 2008 R2 is capable of executing SSIS packages (thanks to the presence of Integration Services runtime), the level of support for more advanced ETL-related functionality varies. In particular, from the package authoring perspective, Workgroup, Web, and Express Editions are limited to the interactive Import and Export Wizard with its basic choice of sources, transformations, and destinations, but without access to Business Intelligence Development Studio with its Designer interface. Both SQL Server Standard and Enterprise Editions offer fully functional Integrated Development Environment (exposed via Business Intelligence Development Studio and its Designer interface) and matching performance capabilities without artificially imposed throttling (performance boundaries are imposed, however, by the 4 processor limit of Standard Edition and restrictions inherent to the operating system on which the SQL Server instance is hosted). Either of them will let you experience the most interesting innovations introduced in the new platform, including the Data Profiling task (facilitating capturing and analyzing data processed by SSIS packages) or Cache Transform transformation, which delivers performance improvements during lookup operations. The primary distinction between the two products is support for a number of more sophisticated Data Flow Task transformations (Data Mining Query, Fuzzy Grouping, Fuzzy Lookup, Term Extraction, or Term Lookup) and destinations (Partition Processing, Dimension Processing, or Data Mining Model Training), as well as Data Mining Query Task, which are available only in the Enterprise Edition.

In our next article, we will take a closer look at the initial setup and configuration of Integration Services in both SQL Server 2008 and 2008 R2 (note that the latter is based on the November 2009 Community Technology Preview).

» See All Articles by Columnist Marcin Policht

MS SQL Archives

Latest Forum Threads
MS SQL Forum
Topic By Replies Updated
SQL 2005: SSIS: Error using SQL Server credentials poverty 3 August 17th, 07:43 AM
Need help changing table contents nkawtg 1 August 17th, 03:02 AM
SQL Server Memory confifuration bhosalenarayan 2 August 14th, 05:33 AM
SQL Server Primary Key and a Unique Key katty.jonh 2 July 25th, 10:36 AM