One of the primary goals of business intelligence is to transform raw data into meaningful information, by combining its sources, discovering its dependencies and patterns, and using them to predict future trends. While the majority of methods that deliver these objectives fall within the scope of Analysis Services and Data Mining, SQL Server 2008 R2 Integration Services offers a wide range of features that help streamline and automate their implementation. In this article, we will present Data Mining Query Task that illustrates this paradigm.
Besides its obvious purpose of executing Data Mining Extension (DMX) queries, the Data Mining Query task provides a means for capturing their outcome into an arbitrary table. Query execution leverages data mining models, which apply specifically crafted algorithms to data exposed via a mining structure.
To better understand this process, let’s review briefly some of its key concepts. Mining structure constitutes a data source view (commonly combining data from multiple sources) along with associated metadata (such as content type and its distribution), filtered in a manner that serves as a basis for subsequent analysis. Such analysis is conducted by employing mining models that might involve additional filtering (by using distinct sets of input columns or selecting rows that contain specific values only) and aliasing (referencing multiple, individually named copies of the mining structure column set) to facilitate execution of their respective mining algorithms (which, in turn, are utilized to execute predictive queries). Mining algorithms are responsible for identifying patterns and interdependencies in arbitrarily selected data subsets (divided into two partitions – training and testing), yielding forecasts that are used, with certain degree of probability, to extrapolate future events.
Selecting the most appropriate algorithm to match a specific business scenario requires fairly in-depth knowledge of their characteristics as well as good understanding of the underlying data and is beyond the scope of this article (for more information about the available choices incorporated into the product, as well as examples illustrating their use, refer to the Data Mining Algorithms article in SQL Server 2008 R2 Books Online).
In order to demonstrate the functionality of Data Mining Query Task, we will take advantage of the AdventureWorksDW2008R2 sample database (containing customer data of a fictitious company that manufactures and sells bicycles to international commercial markets) downloadable from the CodePlex web site, where you will also find detailed procedures illustrating its deployment. The resulting Analysis Services installation includes Targeted Mailing Mining Structure (similar in many aspects to Bike Buyer Mining Structure documented in the tutorial in SQL Server 2008 R2 Books Online), which, in turn, contains four mining models. While any of them can serve as the basis for our presentation, for the sake of clarity, we will choose TM Decision Tree, which is practically identical to the Decision Tree model referenced by the tutorial and use it to compose a batch prediction query. Its general format can be summarized as:
SELECT TOP <select list>
FROM [<mining model name>]
AS [<input alias>]
ON <on clause, mapping,>
WHERE <where clause, boolean expression,>
ORDER BY <expression>
As you can see, DMX queries borrow heavily from T-SQL syntax, with minor differences resulting from idiosyncrasy of Data Mining technology. In particular, <select list> allows you to combine columns from ProspectiveBuyer table (included in the Analysis Services data source (referenced by the OPENQUERY clause) and mining model (in our case, Adventure Works DW and TM Decision Tree respectively), which are joined together via PREDICTION JOIN (via a number of common columns). The WHERE clause indicates that we are interested in prospective customers that are likely to buy a bike and the value of PredictProbability function calculated against the Bike Buyer column provides the probability of such events. Effectively, the resulting query takes the following form:
[TM Decision Tree].[Bike Buyer],
[TM Decision Tree]
OPENQUERY([Adventure Works DW],
‘) AS t
[TM Decision Tree].[Marital Status] = t.[MaritalStatus] AND
[TM Decision Tree].[Gender] = t.[Gender] AND
[TM Decision Tree].[Yearly Income] = t.[YearlyIncome] AND
[TM Decision Tree].[Total Children] = t.[TotalChildren] AND
[TM Decision Tree].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[TM Decision Tree].[Education] = t.[Education] AND
[TM Decision Tree].[Occupation] = t.[Occupation] AND
[TM Decision Tree].[House Owner Flag] = t.[HouseOwnerFlag] AND
[TM Decision Tree].[Number Cars Owned] = t.[NumberCarsOwned]
WHERE[TM Decision Tree].[Bike Buyer] = 1
ORDER BY PredictProbability([Bike Buyer]) DESC
While you can execute this query from the SQL Server Management Studio (by connecting to Analysis Services instance, invoking New Query->DMX option from its context sensitive menu, selecting TM Decision Tree entry in the Mining model listbox, and pasting the statement listed above in the DMXQuery window), we are interested in its execution via SSIS Data Mining Query Task. To accomplish this, launch Business Intelligence Development Studio and create a new project based on the Integration Services template. Next, drag Data Mining Task from the Toolbox onto the Designer window and launch its Editor (by choosing Edit … entry from its context sensitive menu). Create a new connection utilizing OLE DB Provider for Analysis Services pointing to the target instance of Analysis Services (or a solution that includes target Analysis Services project where the relevant mining structures and models are defined). Choose Targeted Mailing mining structure in the drop-down list directly below and verify that its mining models are displayed underneath. Switch to the Query tab and enter the predictive query listed above into its textbox.
Note that you could accomplish the equivalent outcome (with the exception of the ORDER BY clause) by leveraging the New Data Mining Query dialog box (accessible by clicking on Build New Query command button). Parameter Mapping and Result Set tabs are not relevant in this case so effectively you can leave them blank. Switch to the output tab and define the second connection manager, which, as mentioned earlier, designates a database and its table where the results of the task’s execution are written.