Continuing our coverage of SQL Server 2005 Integration Services Data Flow
components, we will present Derived Column transformation, which, while serving
primarily an auxiliary role, should not be ignored, since it offers
considerable flexibility and versatility combined with extreme ease of use. Its
functionality is based on its ability to generate output data rows through
custom-built expressions (constructed with a wide range of built-in functions
and operators), which, besides referencing input columns might also leverage
values stored in package variables. Depending on your preferences, each
expression can be used either to replace data in an input column (as it is
being passed to the output) or to form an extra output column (which requires
that you designate a unique name, data type, and its related characteristics,
such as length, precision, scale, and, for strings, also correct code page). These
changes are typically made via its Editor interface, which offers easy access
to variables and columns (displayed in the upper left section of its window),
functions and operators (located in the listing on the right hand side and
consisting of such items as Mathematical, String, Date/Time, and NULL
functions, Type Casts, as well as Operators), and (in the area at the bottom)
table-formatted structure showing derived columns with each of their
properties.
In order to examine these configuration options more closely, create a new
project of Integration Services type in the Business Intelligence Development
Studio. Once you are presented with the Control Flow tab SSIS Designer, add a
new Data Flow task to it and double click on it (which will allow you to edit
it directly). Drop OLE DB Source onto it (after dragging it from the Toolbox)
and display its Editor by selecting the appropriate option from its context
sensitive menu. In the Editor window, create a new OLE DB connection manager,
pointing to the AdventureWorks database on the server hosting your SQL Server
2005 installation, and specify a valid authentication option. In the Data
access mode listbox, select the "Table or view" option and locate
[Sales].[vSalesPerson] next to the "Name of the table or the view"
entry below. Ensure that all of its columns are included (by checking the
content of the Columns section) and close the Editor window by clicking on the OK
button.
Next, add Derived Column transformation (by dragging its icon from the Toolbox
onto the Data Flow tab area) and connect its input with the output of the OLE
DB Source. Right click on it and select the Edit option to display its Editor
window. Review its layout, noting the listing of variables and columns, as well
as the available functions and operators. Each of the input columns can be
simply copied to the output (use the Input Columns tab in the Advanced Editor
window in order to designate them) or replaced with values obtained by applying
expressions specified in the Expression column within the Editor window. These
expressions can also be used to produce the content of new columns (with
arbitrary names assigned through either of the two Editor windows). To capture
outcome of the transformation, connect its output to a suitable Data Flow
destination.
Despite its straightforward interface, Derived Column delivers quite a diverse
and powerful set of features, which is best explored by testing its ability to
process different types of data. The rich selection of mathematical functions
and operators allow you to perform a variety of calculations and logical
manipulations (when using them, keep in mind their precedence and associativity
rules, which are described in detail in the Books Online).
Among the most useful ones is the conditional operator, which takes the form expression1 ? expression2 : expression3
(equivalent to the IF expression1 THEN
construct common to practically all
expression2 ELSE expression3
programming languages). The operator evaluates the Boolean value of the expression1
and based on its outcome
(True or False) returns a result of expression2
or expression3
(respectively). For example, let’s assume that we want to motivate salespeople,
based on their sales figures, by comparing content SalesYTD and SalesQuota
fields for each. This can be done by adding a new derived column (call it
Motivation) and entering the following formula in the Expression field:
SalesYTD > SalesQuota ? "Carrot" : "Stick"
and ensuring that the designated Data Type matches both expression2
and expression3
(in this case,
string[DT_STR] of length 6 and 1252 Code Page will suffice). If you try to
execute our sample package, you will notice that the results are not complete, due
to the fact that some of the SalesQuota entries are NULL. To resolve this
issue, we will modify expression for the derived column to account for these
exceptions (length of the data type will be automatically adjusted to accommodate
the change):
ISNULL(SalesQuota) ? "Electrode" : (SalesYTD > SalesQuota ? "Carrot" : "Stick")
Another common scenario where Derived Column transformation provides a convenient,
easy to implement solution, involves concatenating entries from several columns
(such as putting together a full address by combining its individual
components). As before, you can verify whether each field contains a value by
applying the ISNULL function to it and, depending on the outcome, substitute it
with an empty string or include its content in the final result. In our sample
package, with input columns named AddressLine1, AddressLine2, City, PostalCode,
StateProvinceName, and CountryRegionName, the relevant expression takes the
form (note that we are adding an extra space to each entry, except for the last
one, in order to provide separation between them):
(ISNULL(AddressLine1) ? "" : AddressLine1 + " ") + (ISNULL(AddressLine2) ? "" : AddressLine2 + " ") + (ISNULL(City) ? "" : City + " ") + (ISNULL(PostalCode) ? "" : PostalCode + " ") + (ISNULL(StateProvinceName) ? "" : StateProvinceName + " ") + (ISNULL(CountryRegionName) ? "" : CountryRegionName)
The ability to process null values gracefully (by replacing them with arbitrary
defaults) is helpful in implementations that involve Lookup Transformations
(for more information on this subject, refer to our
earlier article of this series). Entries without data that fail lookup can
be simply passed through to the output and replaced afterwards with appropriate
values by a Derived Column transformation, located downstream in the Data Flow.
The majority of less complex string operations can be performed by directly
applying functions built into the Derived Column feature set. For example, to
trim leading or trailing blank spaces, simply execute LTRIM
or RTRIM
(respectively). Changing case is easily done with UPPER
or LOWER
. RIGHT
allows you to retrieve the trailing portion of a character expression of a
specified length (to extract the leading string of desired length use the SUBSTRING
function, setting its first
parameter to 1 and the second one to the required number of characters). You
can also handle more involved string manipulations, such as parsing, leveraging
FINDSTRING
(which finds the
starting position of one string within another) or the just mentioned SUBSTRING
(extracting designated portion
of a string). With more complex cases, it might be more sensible to use the
Script Component, since editing lengthy Derived Column expressions can be
rather cumbersome.
Derived Column is also useful for the purpose of data conversion (mainly by
employing type casting operations). If you expect any records in your input to
contain values for which conversion will fail, you can address this issue by
directing Error Output into another Derived Column transformation, which will
replace the "illegal" entries with some valid default (choosing an
appropriate default would depend on the desired data type). The output of the
second transform could then be combined (using Union All Data Flow component)
with the output of the first to yield a complete data set.
Similarly, with the help of expressions available within Derived Column
transformation, it is fairly straightforward to convert between data formats.
For example, the DATEPART function (essentially identical to its T-SQL
counterpart) provides the ability to extract an integer representing a portion
of a specific date. Alternatively, this can be accomplished with DAY, MONTH,
and YEAR functions. This way, converting the current date from the default US
format of MMDDYYYY to YYYY/MM/DD is possible with either of these two
statements:
(DT_STR,4,1252)DATEPART("yyyy",GETDATE()) + "/" + (DT_STR,2,1252)DATEPART("mm",GETDATE()) + "/" + (DT_STR,2,1252)DATEPART("dd",GETDATE()) (DT_STR,4,1252)(YEAR(GETDATE())) + "/" + (DT_STR,2,1252)(MONTH(GETDATE())) + "/" + (DT_STR,2,1252)(DAY(GETDATE())) + "/"
As you can see, Derived Column transformations offer
significant flexibility in the areas of logical and math operations, string
processing, and data conversion. I hope that this article, intended as a brief
overview of some of its capabilities will help you leverage its usefulness in
other scenarios.