While using Microsoft Office Excel to store a variety of semi-structured data tends to be more economical and convenient than SQL Server, it lacks data mining and data warehousing capabilities. This article explores importing spreadsheets onto SQL Server and taking advantage of Office Automation through managed code running inside a Script Task.
Despite the technological superiority of
the SQL Server product line, Microsoft Office Excel remains a
widely popular application used to store a variety of semi-structured data.
While this approach tends to be (at least in a short term) more economical and
convenient, it lacks mining and warehousing capabilities provided by database
management systems. These shortcomings can be remediated by importing
spreadsheets into a relational database. SQL Server 2008 Integration Services simplifies the
task of collating their content into the appropriate format, making them
suitable for transformation and loading. In this article, we will explore one
of several methods that provide this functionality, which takes advantage of Office Automation through managed
code running inside a Script
Task.
The features we are interested in are
exposed by Microsoft.Office.Interop.Excel namespace, incorporated
into the Microsoft
Office
installation and made available via the Office Primary Interop
Assemblies download. They provide the ability to interact with Excel (as well as
other Microsoft
Office
applications) from within Visual Studio Tools for Applications. Since this
technique is intended primarily for client-side programming, you need to be
aware of its limitations (especially in regard to supportability) as well as
additional caveats that need to be taken into account when employing it to
implement server-based COM code running in
a security context of a non-interactive user (for example as a scheduled task
or a SQL Server
Agent
job). As the Microsoft
Technet
article, Considerations for
server-side Automation of Office explains in detail, this
involves potentially unstable, insecure, and inconsistent behavior, related to
dependencies on such factors as user identity, interactivity with the desktop,
reentrancy, and scalability.
Keeping these warnings in mind and
heeding the consequences associated with them, let’s examine what type of
actions can be carried out by leveraging types and members of the Microsoft.Office.Interop.Excel
namespace. To illustrate these capabilities, we will present a generic
scenario, in which we enumerate Excel workbooks residing in an arbitrarily
chosen location, then identify their worksheets and named ranges for each of
them, and finally display the content of their individual cells. The code will
be implemented using Visual
Studio Tools for Applications accessible via SSIS Script Task.
To start, launch Business Intelligence
Development Studio
and create a new project based on the Integration Services template. Drag the Script Task icon from the Toolbox and drop it on
the Designer interface.
Display its Editor dialog box,
designate Visual
Basic .NET 2008
as the ScriptLanguage, and confirm
your choice by clicking on the OK
command button. Use the Variables window to
define a new variable (we will call it FolderName) of String type and set its value to the name of a
folder where the Excel spreadsheets
reside. Next, activate the Script Task Editor dialog box (accessible via the Edit… option in its
context sensitive menu), add the User::FolderName to the ReadOnlyVariables entry of the Script section, and
click on the Edit
Script…
button to display Visual
Studio Tools for Applications.
In order to facilitate interaction with
the Excel object model,
we need to add references to the relevant library (which you can identify by
referring to the Microsoft
Knowledge Base
article 219151). To
accomplish this, select the Properties
entry from the context-sensitive menu of the project displayed in the Solution Explorer window within
the Visual
Studio Tools for Applications Tools interface. Switch to the References section, click
on the Add… command button,
display content of COM tab, highlight
the appropriate object library entry (which, in our example, happens to be Microsoft Excel 12.0 Object
Library,
corresponding to Excel
2007),
and confirm your choice by clicking on the OK command button. Once you return to the ScriptMain.vb window, add Imports
Microsoft.Office.Interop.Excel and Imports System.IO (used for file system operations)
statements to its (General)(Declarations) section.
Finally, copy the following code as the content of Public Sub Main():
Public Sub Main() Dts.TaskResult = ScriptResults.Success Dim strFile As String Dim objExcelApp As New Microsoft.Office.Interop.Excel.Application Dim objExcelWbk As Microsoft.Office.Interop.Excel.Workbook Dim objExcelSheet As Microsoft.Office.Interop.Excel.Worksheet Dim objExcelCell As Microsoft.Office.Interop.Excel.Range Dim intSheet As Integer Dim intRow As Integer Dim intCol As Integer Try Dim strFolder = Dts.Variables("FolderName").Value.ToString Dim colFiles As String() = Directory.GetFiles(strFolder, "*.xlsx") For Each strFile In colFiles MessageBox.Show(strFile) objExcelWbk = objExcelApp.Workbooks.Open(strFile) MessageBox.Show(objExcelWbk.Worksheets.Count) For intSheet = 1 To objExcelWbk.Worksheets.Count objExcelSheet = objExcelWbk.Worksheets(intSheet) MessageBox.Show(objExcelSheet.Name.ToString) For intRow = 1 To objExcelSheet.UsedRange.Rows.Count For intCol = 1 To objExcelSheet.UsedRange.Columns.Count objExcelCell = objExcelSheet.UsedRange.Cells(intRow, intCol) If Not (objExcelCell.Value = Nothing) Then MessageBox.Show(objExcelCell.Value.ToString, "Content of cell " + _ objExcelWbk.Name.ToString + "." + objExcelSheet.Name.ToString + _ "." + intRow.ToString + ":" + intCol.ToString, MessageBoxButtons.OK) End If Next Next Next objExcelWbk.Close(False) Next objExcelSheet = Nothing objExcelWbk = Nothing Catch ex As Exception Dts.TaskResult = ScriptResults.Failure MessageBox.Show(ex.Message.ToString(), "Exception", MessageBoxButtons.OK) Finally objExcelApp.Workbooks.Close() objExcelApp.Quit() End Try End Sub
Our script enumerates all files with the
extension .xlsx residing in the
folder, whose name is obtained by referencing the FolderName SSIS variable. For
each of them, we drill down through Workbooks and Worksheets collections (accessible via Microsoft.Office.Interop.Excel.Application namespace)
using their Open method, down to
the individual cell level (taking advantage of the Microsoft.Office.Interop.Excel.Range object, its Rows and Columns properties, as
well as their respective collections).