Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Links Database Forum

» Database Journal Home
» Database Articles
» Database Tutorials
MS SQL
Oracle
DB2
MS Access
MySQL
» RESOURCES
Database Tools
SQL Scripts & Samples
Links
» Database Forum
» Sitemap
Free Newsletters:
DatabaseDaily  
News Via RSS Feed


follow us on Twitter
Database Journal |DBA Support |SQLCourse |SQLCourse2
 

Featured Database Articles

MS SQL

Posted September 16, 2011

Data Mining Query Task in SQL Server Integration Services

By Marcin Policht

One of the primary goals of business intelligence is to transform raw data into meaningful information, by combining its sources, discovering its dependencies and patterns, and using them to predict future trends. While the majority of methods that deliver these objectives fall within the scope of Analysis Services and Data Mining, SQL Server 2008 R2 Integration Services offers a wide range of features that help streamline and automate their implementation. In this article, we will present Data Mining Query Task that illustrates this paradigm.

Besides its obvious purpose of executing Data Mining Extension (DMX) queries, the Data Mining Query task provides a means for capturing their outcome into an arbitrary table. Query execution leverages data mining models, which apply specifically crafted algorithms to data exposed via a mining structure.

To better understand this process, let's review briefly some of its key concepts. Mining structure constitutes a data source view (commonly combining data from multiple sources) along with associated metadata (such as content type and its distribution), filtered in a manner that serves as a basis for subsequent analysis. Such analysis is conducted by employing mining models that might involve additional filtering (by using distinct sets of input columns or selecting rows that contain specific values only) and aliasing (referencing multiple, individually named copies of the mining structure column set) to facilitate execution of their respective mining algorithms (which, in turn, are utilized to execute predictive queries). Mining algorithms are responsible for identifying patterns and interdependencies in arbitrarily selected data subsets (divided into two partitions – training and testing), yielding forecasts that are used, with certain degree of probability, to extrapolate future events.

Selecting the most appropriate algorithm to match a specific business scenario requires fairly in-depth knowledge of their characteristics as well as good understanding of the underlying data and is beyond the scope of this article (for more information about the available choices incorporated into the product, as well as examples illustrating their use, refer to the Data Mining Algorithms article in SQL Server 2008 R2 Books Online).

In order to demonstrate the functionality of Data Mining Query Task, we will take advantage of the AdventureWorksDW2008R2 sample database (containing customer data of a fictitious company that manufactures and sells bicycles to international commercial markets) downloadable from the CodePlex web site, where you will also find detailed procedures illustrating its deployment. The resulting Analysis Services installation includes Targeted Mailing Mining Structure (similar in many aspects to Bike Buyer Mining Structure documented in the tutorial in SQL Server 2008 R2 Books Online), which, in turn, contains four mining models. While any of them can serve as the basis for our presentation, for the sake of clarity, we will choose TM Decision Tree, which is practically identical to the Decision Tree model referenced by the tutorial and use it to compose a batch prediction query. Its general format can be summarized as:

SELECT TOP  <select list>
FROM [<mining model name>]
PREDICTION JOIN
OPENQUERY([<datasource>],'<SELECT statement>')
  AS [<input alias>]
ON <on clause, mapping,>
WHERE <where clause, boolean expression,>
ORDER BY <expression>

As you can see, DMX queries borrow heavily from T-SQL syntax, with minor differences resulting from idiosyncrasy of Data Mining technology. In particular, <select list> allows you to combine columns from ProspectiveBuyer table (included in the Analysis Services data source (referenced by the OPENQUERY clause) and mining model (in our case, Adventure Works DW and TM Decision Tree respectively), which are joined together via PREDICTION JOIN (via a number of common columns). The WHERE clause indicates that we are interested in prospective customers that are likely to buy a bike and the value of PredictProbability function calculated against the Bike Buyer column provides the probability of such events. Effectively, the resulting query takes the following form:

SELECT
  TOP 100
  t.[LastName],
  t.[FirstName],
  [TM Decision Tree].[Bike Buyer],
  PredictProbability([Bike Buyer])
FROM
  [TM Decision Tree]
PREDICTION JOIN
  OPENQUERY([Adventure Works DW],
    'SELECT
      [LastName],
      [FirstName],
      [MaritalStatus],
      [Gender],
      [YearlyIncome],
      [TotalChildren],
      [NumberChildrenAtHome],
      [Education],
      [Occupation],
      [HouseOwnerFlag],
      [NumberCarsOwned]
    FROM
      [dbo].[ProspectiveBuyer]
    ') AS t
ON
  [TM Decision Tree].[Marital Status] = t.[MaritalStatus] AND
  [TM Decision Tree].[Gender] = t.[Gender] AND
  [TM Decision Tree].[Yearly Income] = t.[YearlyIncome] AND
  [TM Decision Tree].[Total Children] = t.[TotalChildren] AND
  [TM Decision Tree].[Number Children At Home] = t.[NumberChildrenAtHome] AND
  [TM Decision Tree].[Education] = t.[Education] AND
  [TM Decision Tree].[Occupation] = t.[Occupation] AND
  [TM Decision Tree].[House Owner Flag] = t.[HouseOwnerFlag] AND
  [TM Decision Tree].[Number Cars Owned] = t.[NumberCarsOwned]
WHERE[TM Decision Tree].[Bike Buyer] = 1
ORDER BY PredictProbability([Bike Buyer]) DESC

While you can execute this query from the SQL Server Management Studio (by connecting to Analysis Services instance, invoking New Query->DMX option from its context sensitive menu, selecting TM Decision Tree entry in the Mining model listbox, and pasting the statement listed above in the DMXQuery window), we are interested in its execution via SSIS Data Mining Query Task. To accomplish this, launch Business Intelligence Development Studio and create a new project based on the Integration Services template. Next, drag Data Mining Task from the Toolbox onto the Designer window and launch its Editor (by choosing Edit ... entry from its context sensitive menu). Create a new connection utilizing OLE DB Provider for Analysis Services pointing to the target instance of Analysis Services (or a solution that includes target Analysis Services project where the relevant mining structures and models are defined). Choose Targeted Mailing mining structure in the drop-down list directly below and verify that its mining models are displayed underneath. Switch to the Query tab and enter the predictive query listed above into its textbox.

Note that you could accomplish the equivalent outcome (with the exception of the ORDER BY clause) by leveraging the New Data Mining Query dialog box (accessible by clicking on Build New Query command button). Parameter Mapping and Result Set tabs are not relevant in this case so effectively you can leave them blank. Switch to the output tab and define the second connection manager, which, as mentioned earlier, designates a database and its table where the results of the task's execution are written.

See all articles by Marcin Policht



MS SQL Archives

Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 




Latest Forum Threads
MS SQL Forum
Topic By Replies Updated
SQL 2005: SSIS: Error using SQL Server credentials poverty 3 August 17th, 07:43 AM
Need help changing table contents nkawtg 1 August 17th, 03:02 AM
SQL Server Memory confifuration bhosalenarayan 2 August 14th, 05:33 AM
SQL Server Primary Key and a Unique Key katty.jonh 2 July 25th, 10:36 AM


















Thanks for your registration, follow us on our social networks to keep up-to-date