Data Mining Query Transformation in SQL Server Integration Services

In one of our recent articles dedicated to SQL Server 2008 R2 Integration Services, we presented the characteristics and usage of Data Mining Query Task. As explained, its primary objective is to execute Data Mining Extension (DMX) queries against data mining models and capture their outcome into an arbitrary table.

In the case of prediction queries, which are of particular interest to us in the context of our presentation, this outcome is calculated based on a join between an external data set (identified by OPENQUERY clause) and a suitable mining model. A sample task, employed by us to demonstrate this concept, relied on an Analysis Services data source (pointing to the ProspectiveBuyer table of Adventure Works DW database) to produce a single data set. Now it is time to describe a different approach to executing DMX queries, which incorporates them into SSIS Data Flow.

While the underlying basis mechanism remains the same, there are some important differences resulting from the distinct nature of Data Flow components. Control Flow deals with orderly processing of tasks, which are individual, isolated units of work that perform a specific action ending with a finite outcome that can be evaluated as either Success, Failure, or Completion. While their sequence can be customized by linking them into arbitrary arrangements with precedence constraints and grouping them together or repeating their execution in a loop with the help of containers, a subsequent task does not initiate unless its predecessor has completed.

Data Flow, on the other hand, carries out its operations by employing the pipeline paradigm, processing data record by record (or rather, to be more accurate, memory buffer by memory buffer) originating from its source and ending at its a destination, modifying it in transit by applying transformations, although there are exceptions to this rule, since some of the components, such as Sort or Aggregate, depend on the ability to view the entire data set before handing it over to their downstream counterparts. Another distinction is the absence of a mechanism that would allow direct transfer of data between individual control flow tasks.

This particular distinction becomes apparent when we compare capabilities of Data Mining Query Transformation and its Control Flow equivalent. While both of them might allow us to obtain the same information – for example, identify all prospective customers according to predictions of an arbitrary mining model – the former simplifies its further processing, which can be easily facilitated by feeding its output to an input of another, downstream component. Even more importantly, such processing can be incorporated into the existing pipeline, which automatically provides its benefits, such as optimized performance due to ability to utilize memory-resident data sets without resorting to disk access or relying on temporary storage.

In order to demonstrate the functionality of Data Mining Query Transformation and contrast its characteristics with Data Mining Query Task, we will implement the same sample scenario discussed in our recent article. This scenario is based on the content of AdventureWorksDW2008R2 database, which contains customer data of a fictitious company that manufactures and sells bicycles to international commercial markets and is downloadable from the CodePlex web site, where you will also find detailed procedures illustrating its deployment. The resulting Analysis Services installation includes Targeted Mailing Mining Structure (similar in many aspects to Bike Buyer Mining Structure documented in the tutorial in SQL Server 2008 R2 Books Online), which, in turn, contains four mining models. Just as before, we will choose TM Decision Tree to compose a prediction query. However, rather than referencing the ProspectiveBuyer table via OPENQUERY clause, we will process its content row by row employing Data Flow pipeline, which is represented in this case by @InputRowset parameter. In addition, we use the FLATTENED option of the SELECT statement in order to convert any potentially nested results into a single denormalized table format, which is required in this case, since SSIS Data Flow does not support nested rowsets. Effectively, our query takes the following format (note that you might be able, at least to some extent, construct it by leveraging New Data Mining Query dialog box accessible via Build New Query command button):

SELECT FLATTENED

  [TM Decision Tree].[Bike Buyer],

  PredictProbability([Bike Buyer])

FROM

  [TM Decision Tree]

PREDICTION JOIN

  @inputRowset AS t

ON

  [TM Decision Tree].[Marital Status] = t.[MaritalStatus] AND

  [TM Decision Tree].[Gender] = t.[Gender] AND

  [TM Decision Tree].[Yearly Income] = t.[YearlyIncome] AND

  [TM Decision Tree].[Total Children] = t.[TotalChildren] AND

  [TM Decision Tree].[Number Children At Home] = t.[NumberChildrenAtHome] AND

  [TM Decision Tree].[Education] = t.[Education] AND

  [TM Decision Tree].[Occupation] = t.[Occupation] AND

  [TM Decision Tree].[House Owner Flag] = t.[HouseOwnerFlag] AND

  [TM Decision Tree].[Number Cars Owned] = t.[NumberCarsOwned]

WHERE [TM Decision Tree].[Bike Buyer] = 1

As in our earlier example, the intention is to identify records in the ProspectiveBuyer table that represent customers, who, according to the DM Decision Tree mining model, are likely to purchase a bike.

To implement our objective using Data Mining Query Transformation, launch Business Intelligence Development Studio and create a new project based on Integration Services template. Switch to the Data Flow tab and activate the hyperlink displayed in the center of the main window stating No Data Flow tasks have been added to this package. Click here to add a new Data Flow task. With the new Data Flow task in place and active, drag OLE DB Source from the Toolbox onto the Designer interface and select the Edit… option from its context sensitive menu. In the Connection Manager section of its Editor window, click on the New… command button. In the Connection Manager dialog box, type in the target Server name, specify the appropriate authentication method, and use Test Connection to validate your entries. In the Data access mode listbox, select Table or view and point to the [dbo].[ProspectiveBuyer] table. In the Columns section, deselect those which do not appear in the query statement listed above, click on the OK command button to confirm your choices, which will bring you back to the Designer interface.

Now it is time to define our transformation. Drag Data Mining Query from the Toolbox, position it directly underneath the newly created OLE DB Source, and extend the arrow representing the Data Flow Path to connect them together. Display the Data Mining Query Transformation Editor. From its Mining Model tab, define a new connection manager to the target instance of Analysis Services, select Targeted Mailing mining structure, and verify that it includes TM Decision Tree mining model. Switch to the Query tab and paste the entire content of our sample predictive query as it appears above.

Finally, add the OLE DB Destination to the Data Flow tab and connect the Data Flow Path to its input. For the sake of simplicity, create a new OLE DB connection manager pointing to the tempdb database on the target SQL Server instance, designate Table or view – fast load data access mode, and define a new table that will contain results of our transformation. By default, the table includes, in addition to columns referenced in the ProspectiveBuyer table, [Bike Buyer] (which, in our case, is equal to 1) and Expression (that represents probability with which a given customer is expected to make a purchase). This step concludes our setup, yielding a functional package, which execution populates the results table. Note that, as we pointed out earlier, it would be fairly straightforward to make its structure more elaborate simply by incorporating other components into the existing pipeline (for example, by adding the Conditional Split transformation to divide buyers into several groups, according to their varying level of expected motivation, as predicted by the mining model).

See all articles by Marcin Policht

 

Marcin Policht
Marcin Policht
Being a long time reader, I'm proud to join the crowd of technology gurus gathered here. I have a fair share of Microsoft exams behind me, a couple of acronyms after my name - MCSE, MCSD, and MCT, decent familiarity with majority of MS BackOffice products (SMS, SQL, Exchange, IIS), programming and scripting languages (VB, C++, VBScript with wsh) and several years of practical experience with Windows environment administration and engineering. My focus these days is on SQL Server and Windows 2000, and I'll attempt to share the most interesting experiences with these products.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles