SQL Server 2005 - SQL Server Integration Services - Part 1
May 13, 2005
One of the SQL Server 2005 Beta 2 features that has undergone considerable architectural, functional, and interface changes, compared with its earlier versions, deals with extraction, transformation, and loading (also known as simply ETL) of data. (In other words, allowing you to move data between a variety of stores and to modify it in the process, according to your arbitrary requirements). While in SQL Server 7.0 and 2000, such capabilities have been provided by Data Transformation Services (DTS in short), Microsoft decided that the degree of modifications introduced in the new version justified re-branding it, hence its rearchitectured and significantly improved implementation has been introduced under the new name of SQL Server Integration Services (SSIS). For the same reason, rather than pointing out changes that have been implemented since the release of DTS in SQL Server 2000, in the next few articles, we will provide a comprehensive analysis of its SQL Server 2005 equivalent. For review of the SQL Server 2000 version of Data Transformation Services refer to our earlier series covering this topic and published on the Database Journal Web site).
Let's start with an overview of the most essential terms and ideas that are critical to understanding how ETL operations are handled in SQL Server 2005. The concept of the DTS Package has been inherited from the earlier versions of SQL Server and still is considered as the centerpiece of the data transformation object model, representing a unit of work that can be independently saved, retrieved, or executed, and which serves as a container for all other DTS objects. Within a package, you can distinguish two main categories of objects - control flow or data flow. Objects in the first category determine processing sequence within the package and consist of containers, tasks, and precedence constraints. Objects that belong to the second category define the flow of data. They can be further divided into source adapters, transformations, and destination adapters. Control flow dictates architectural structure of a package and can contain any number of data flow units (defined independently of each other and consisting of any number of data flow objects). The following are short descriptions of each of the control and data flow object types:
On a more granular level, data and control flow elements include other types of package components, such as connection managers and variables. Connection managers are logical representations of connections to various types of data stores (such as ActiveX data objects and ADO.NET objects, file system folders and files, message queues, OLE DB- or ODBC-compatible databases or SMTP mail servers) along with their properties (for example a connection string, data source identifier, or description). Connections can be defined as part of tasks or transformations, and are used when setting up source and destination adapters. Variables provide temporary storage for parameters whose values can change from one package execution to another, accommodating package reusability (for example, allowing to execute the same T-SQL statement or a script against a different set of objects). They can be incorporated into containers, tasks, event handlers, and precedence constraints. In general, there are two types of variables - system (pre-defined) and custom (created on an as needed basis at the time of package development). We will demonstrate their use in future articles of this series.
Also among new DTS package components introduced in SQL Server 2005 are event handlers and log providers. With the former, you can write custom code that is invoked in response to a specific type of event triggered by executable components, such as ForEach Loop, For Loop, Sequence, and TaskHost containers (as well as packages themselves). For example, by creating a handler for the OnError event, you might be able to gracefully resolve a particular type of error condition or provide notification about its occurrence. Log providers allow the capturing of events taking place when tasks execute within ForEach Loop, For Loop, and Sequence containers and record them in a text or XML file, SQL Profiler log, SQL Server databases, or the Windows Application Event log.
Now, following our short introduction into ETL concepts, let's look next into their implementation, in particular focusing on management lifecycle of DTS packages. In our discussion about the utilities present in SQL Server 2005 Beta 2, we briefly mentioned Business Intelligence Development Studio. This is the primary environment for package development. As with the SQL Server Management Studio, its interface follows the design philosophy derived from Microsoft Visual Studio .NET, reflecting this way its main purpose (as far as Business Intelligence solutions are concerned). Within its scope, besides Analysis Services and Reporting Services-related features, there is also a set of utilities and wizards that provide the ability to create, modify (with proper version control through integration with Visual SourceSafe), debug, and deploy packages. Packages are developed based on the paradigm of a solution consisting of one or more DTS specific projects, which also can be combined with Analysis or Reporting Services projects. Subsequently, such solutions can be managed with SQL Server Management Studio, which includes the ability to store them in the msdb database (in addition to using .dtsx file format - option that is also available from Business Intelligence Development Studio), schedule, and execute them (but not modify, deploy, debug, or provide version control).
The main ETL development utility included with SQL Server Business Intelligence Development Studio is DTS Designer. DTS Designer is a sophisticated and highly versatile tool, giving access to an overwhelming majority of DTS functionality. In addition, SQL Server Management Studio offers DTS Import/Export Wizard (also accessible by invoking the DTSWizard executable, which resides in C:\Program Files\Microsoft SQLServer\90\Tools\Binn\VSShell\Common7\IDE folder, from the Command Prompt). It provides a limited ability to create packages outside of the primary development environment (but at the same rate, simplifying the development process by delivering an intuitive, step-by-step guide). It is also possible to fully manage packages with purely programmatic methods. Package execution is triggered with the Start entry from the Debug menu (or equivalent toolbar button) in the DTS Designer interface (corresponding to F5 function key shortcut). The same can be accomplished with DTExec or DTExecUI command line utilities (both located in the C:\Program Files\Microsoft SQL Server\90\DTS\Binn folder). Besides running packages interactively, you also have an option of scheduling them for unattended execution with SQL Server Management Studio. The DTUtil command line utility (also residing in C:\Program Files\Microsoft SQL Server\90\DTS\Binn folder) offers the ability to copy packages between files, msdb SQL Server database, or DTS Package Store, as well as signing them in order to protect their content. To deploy packages, use DTS Package Deployment Wizard, which installs packages either to the file system or msdb database. Package Configurations, new in SQL Server 2005, facilitate the changing of properties of package components at run time, by applying values stored outside of it (it is possible to use environment variables, registry entries, XML-formatted files, or variables contained in a parent package for this purpose).
While each edition of SQL Server 2005 includes the ability to execute packages, the level of the remaining ETL related functionality varies. In particular, no additional features are available in the Express and Workgroup versions. Starting with the Standard edition, you have access to Business Intelligence Development Studio, which allows you to develop packages with DTS Designer. More advanced transformations and tasks, dealing with data and text mining or fuzzy operations (such as Analysis Services Partition Processing Destination, Analysis Services Dimension Processing Destination, Data Mining Training Destination, Data Mining Query Component and Task, Term Extraction and Lookup, as well as Fuzzy Grouping and Lookup) are available only when you switch to the Enterprise product. Fuzzy operations, new in the SQL Server product line, provide the ability to clean up data, either at the time of its loading (Lookup) or already existing in a data store (Grouping), by comparing the content of "candidates" rows against the ones already established as valid, and finding the closest matches in case of inconsistencies. (This eliminates the potential for misspellings, spelling variations of the same name, which yield to duplications and other typical data input-related issues).
Another ETL-related change in SQL Server 2005 is the introduction of Integration
Services Service, running in the security context of NT AUTHORITY\NETWORK
account, which takes part in managing the operation of DTS packages. It
is controlled from the SQL Server Management Studio from where you have an
option of starting or stopping it (pausing and restarting states are not
available). Its main purpose is to monitor status of running
In our next article, we will present the process of creating a simple package, explaining at the same time all relevant ETL functionality in more detail.