In the recent installments of our series dedicated to SQL Server 2005
Integration Services, we have been discussing individual Control Flow tasks,
focusing on their unique functionality and characteristics. One of the few
remaining items in this category is the Bulk Insert task, which will be the
topic of this article.
Bulk Insert functionality has been available since SQL Server 7.0 (in the
form of BULK INSERT T-SQL statement), when it was introduced to provide a fast
and straightforward method of importing data into a database (and supplement
the slower, although more versatile bcp Command Line utility). Despite its
limited flexibility, (we will be discussing these limitations in more detail
shortly), it is still commonly used due to its impressive efficiency. With
minor exceptions, which we described when covering SQL
Server 2005 Integration Services performance (where we pointed out that
when importing files containing large amounts of date, time, currency, or
numeric data in the most common formats, without locale specific and special
currency characters or missing and implied date parts, it is typically possible
to minimize processing time by reading their content with the Flat File Source
Data Flow component and having its Fast Parse property set to True), Bulk
Insert offers superior speeds. There are, however, several caveats that you
need to be aware of in order to fully take advantage of its potentials.
Among the most relevant warning (when dealing with sizeable data sources)
is configuring your target database with Bulk-logged recovery model. This can
be accomplished from the Options section of the Database Properties dialog box
in the SQL Server Management Studio, which keeps track of extent and page
allocations, but does not register individual record additions. (Keep in mind,
however, implications regarding recoverability and perform a backup before and
after the insert operation – as well as switch back to full recovery model
afterwards). Consider locking the target table (by executing sp_tableoption
stored procedure to enable its "table lock on bulk load" option or by
running BULK INSERT with TABLOCK hint) in order to minimize lock contention and
avoid row level locking, which in turn improves efficiency. Avoid having
indexes on a target table (you might want to drop them prior to data load,
providing that the amount of data to be imported significantly exceeds the number
of records already present in the table) and ensure that the table is not
marked for replication. If possible, copy a source file to the same server
where the Bulk Insert task will be running (in order to avoid a delay
associated with transferring it over the network during package execution).
This also will let you disregard additional requirements that result from
security context in which BULK INSERT is operating which, in turn, depends on
the way the package is launched (for more information on this subject, refer to
one
of our earlier articles of this series). Keep in mind that a package
containing Bulk Insert task can be executed only by members of the sysadmin
fixed server role. Divide your entire load into individual batches (with the
BATCHSIZE parameter of BULK INSERT), which are then processed as separate
transactions (unless the task joins a transaction already in progress, in which
case its batches are subject to the parent’s commit or rollback). This makes
larger data loads more manageable and improves the potential for concurrency.
When replacing existing records, truncate the target table (or recreate it)
rather than using fully logged DELETE T-SQL command.
To demonstrate implementation of these recommendations in the context of
SSIS Bulk Insert task, we will leverage an example
posted on the MSDN web site (which is based on the AdventureWorks sample
database, so make sure you have it installed – you can download the relevant
Windows Installer .MSI setup files from the Microsoft
Download Center). Initiate a new project of Integration Services type in
the SQL Server Business Intelligence Studio. Add Execute SQL Task (which will
create the HumanResources.myTeam table – as in MSDN example) to the Control
Flow of its Designer interface. Select the Edit… option from the task’s
context sensitive menu to display its Editor window. In the General section,
assign a new OLE DB Connection Manager pointing to the AdventureWorks database
on the local SQL Server 2005 instance (with appropriate authentication
settings). With SQLSourceType set to "Direct input", copy the
following T-SQL Statement as the SQLStatement property value:
USE AdventureWorks; IF EXISTS (SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'HumanResources' AND TABLE_NAME = 'myTeam') DROP TABLE HumanResources.myTeam GO CREATE TABLE HumanResources.myTeam (EmployeeID smallint NOT NULL, Name nvarchar(50) NOT NULL, Title nvarchar(50) NULL, Background nvarchar(50) NOT NULL DEFAULT '' ); GO
The statement creates a HumanResources.myTeam table (after
checking first if one does not already exist, and if so, dropping it). We will
populate it with sample data consisting of the following records (save it to an
arbitrarily named file on the computer hosting your SQL Server 2005
installation):
77,'Mia Doppleganger','Administrative Assistant','Microsoft Office' 49,'Hirum Mollicat','I.T. Specialist','Report Writing and Data Mining'
Once the file is saved, create a Flat File Connection
Manager that points to it. Accept its default values with the exception of the Text
qualifier value (in the General section of the Connection Manager Editor
window), which you might want to change to a single quote (‘), to reflect the
way data strings are marked. Verify that the values are displayed properly
(i.e. without the single quotes) after you switch to the Columns section and
close the Editor window. Next, drag the Bulk Insert Task icon from the Toolbox
onto the Designer interface directly underneath the Execute SQL task and
connect them with precedence constraint (such that execution of the former
depends on the successful completion of the latter). Using the Edit option in
the context sensitive menu of the Bulk Insert Task, display its Editor window.
In its Connections section, select the earlier defined OLE DB connection,
pointing to the AdventureWorks database on the local SQL Server installation
and specify its [HumanResources].[myTeam] table as the DestinationTable (note
that you can also designate views as import targets). Properties grouped in the
Format subsection specify the way source data is delimited. This information
can be provided directly (by assigning values to RowDelimiter and ColumnDelimiter
properties, which by default are set to {CR}{LF} and Tab, respectively), or
through a format file (which we will cover in our next article). To accommodate
our sample data, change the value of ColumnDelimiter from Tab to Comma {,}.
Using File property under the Source Connection subsection, choose the earlier
defined Flat File Connection Manager, pointing to the local source data file.
Switch to the Options section in the Bulk Insert Task Editor interface. The choices
available here might have significant impact on the way the task is processed,
so let’s review them in more detail:
-
CodePage – specifies the code page of character data in the
source file. The default of RAW speeds up processing by skipping translation
from one code page to another (you need to determine whether this is applicable
depending on content of your source data and code page of a target database). -
DataFileType – can take on one of four possible values – char
(default indicating standard character format), native (offering superior
performance compared to the char option but applicable to native data types), widechar
(similar to char, but accommodating Unicode characters), and widenative
(performing better than the widechar option, allowing for Unicode characters in
char, varchar, and text columns, but expecting native data types in all
others). -
BatchSize – assigns the number of rows to be processed as a
single transaction (the default value of zero loads the entire data file as
part of the same transaction). -
LastRow – specifies the last row from the data file to be
included. -
FirstRow – specifies the first row from the data file to be
included. -
Options – affects the speed of loading by dictating whether to
perform such actions as checking for constraints on a target table and its
columns (this might have significant impact if your target table contains a
clustered index), keeping nulls in columns for which data is not included in
the source file, enabling identity insert, applying table lock (invoking
TABLOCK hint), and firing triggers (resulting from inserting rows into target
table). -
SortedData – allows you to sort imported data by specifying the
column to ORDER BY in the target table. For the sake of efficiency, you might
want to consider keeping it at its default FALSE value. -
MaxErrors – sets the maximum number of errors that can take place
before the task execution is aborted.
With our simplified source file, you can simply leave all the defaults in
place (you would need to adjust them if you dealt with a more realistic data
set, according to its size, filetype, code page translation, etc.). Close the
task Editor window and launch the package. Once the execution completes, the
target table should have the following content:
77 'Mia Doppleganger' 'Administrative Assistant' 'Microsoft Office' 49 'Hirum Mollicat' 'I.T. Specialist' 'Report Writing and Data Mining'
Note that the table contains the same records as our source file, including
single quotes used to qualify text fields (despite the fact that we defined
them as Text qualifiers in our Flat File Connection Manager). While you can
cleanse the data prior to or after the import, a more efficient option is to
address this issue by employing the earlier mentioned format files, which
provide additional processing capabilities. We will present this approach in
our next article.