The primary purpose of data mining is discovering patterns in raw data, which, in turn, offers a means to estimate its future trends. This is accomplished by analyzing its values in the context of a few deliberately chosen mining models and their respective mining algorithms, which identify interrelated columns in the underlying mining structure and dictate how their content could be utilized to express this relationship in mathematical terms.
The process of pattern discovery, typically based on relatively small data sets, is referred to as training and involves populating a designated mining model with that data, which provides the basis for subsequent predictions performed against substantially larger volumes of data. In this article, we will present a Data Mining Model Training Destination that allows you to incorporate this functionality directly into Data Flow of SQL Server 2008 R2 Integration Services-based packages. Such implementation approach is particularly handy in situations where models need to be retrained on regular basis due to changes in data structure or business parameters they reference.
Training of a mining model relies on a subset of data representing business events that are the subject of the corresponding mining process. For the sake of simplicity, in our presentation we will take advantage of the existing AdventureWorksDW2008R2 database (and dependent on it Adventure Works DW 2008 R2 Analysis Services project), downloadable from the CodePlex web site, where you will also find a detailed procedure illustrating its deployment.
The resulting Analysis Services installation includes Targeted Mailing Mining Structure (similar in many aspects to Bike Buyer Mining Structure documented in the tutorial in SQL Server 2008 R2 Books Online), which, in turn, contains four mining models (TM Clustering, TM Decision Tree, TM Naive Bayes, and TM Neural Net).
Since carrying out a training typically takes into account only a portion of entire data set, we will employ Percentage Sampling Data Flow transformation to extract an arbitrarily chosen percentage of rows. Alternatively, you could accomplish the same goal by utilizing Row Sampling transformation, which allows you to designate specific number of rows instead (note that you could also apply either of these mechanisms to specify records to be used to carry out mining model validation, although you should ensure that they yield distinct, non-overlapping subsets, since utilizing the same one for both is bound to lead to skewed results). Depending on circumstances, it might also be prudent to implement data cleaning by removing invalid or superfluous records and modifications (changing its characteristics to make it suitable for mining calculations), which we will discuss in one of our future articles.
Let’s step through an example illustrating an implementation that leverages Data Mining Model Training destination in a SQL Server 2008 R2 Integration Services-based package, relying on the AdventureWorksDW2008R2 sample database for training data and on Adventure Works DW 2008 R2 Analysis Services project for the corresponding mining structure. Start by launching Business Intelligence Development Studio and create a new project based on Integration Services template. Switch to the Data Flow tab and activate the hyperlink displayed in the center of the main window stating No Data Flow tasks have been added to this package. Click here to add a new Data Flow task.
With the new Data Flow task in place and active, drag OLE DB Source from the Toolbox onto the Designer interface and select Edit… option from its context sensitive menu. In the Connection Manager section of its Editor window, click on New… command button. In the Connection Manager dialog box, type in the target Server name (which hosts the AdventureWorksDW2008R2 database), specify the appropriate authentication method, and use Test Connection to validate your entries. In the Data access mode listbox select Table or view and point to the [dbo].[vTargetMail] item. Leave the Columns section unchanged (we will deal with those that are not included in our sample Targeted Mailing mining structure later).
Next, drag Percentage Sampling from the Toolbox onto the Designer area, position it below OLE DB Source, and connect the data flow path originating from the latter to the top of the rectangle representing the former. Launch its Editor window by selecting Edit… entry from its context sensitive menu. The transformation operates by dividing its input into two outputs based on the ratio you assign via Percentage of rows listbox, named by default Sampling Selected Output and Sampling Unselected Output, which you can alter in an arbitrary manner using Sample output name and Unselected output name textboxes. By default, rows are selected at random, based on the Windows tick count, which guarantees unique results during each execution. If your intention is to ensure identical row selection during consecutive runs, you have an option to assign an integer value that will serve as the randomization seed, which will remain the same whenever the package is invoked.
Now it is time for our protagonist. Scroll down to the bottom section of the Toolbox and drag its Data Mining Model Training destination onto the Designer interface, placing it below the Percentage Sampling transformation. Extend the data flow path carrying the Sampling Selected Output to its top edge and launch its editor window by selecting Edit… entry from its context sensitive menu. On the Connection tab, define a new connection manager to the target instance of Analysis Services (alternatively, it is also possible to create a connection to an Analysis Services project in the same solution), select Targeted Mailing mining structure (note that you also have an option to create a new one by invoking Data Mining Wizard directly from this interface by clicking on the New command button), and verify that it includes all of our mining models (i.e., TM Clustering, TM Decision Tree, TM Naive Bayes, and TM Neural Net). Switch to the Columns tab and ensure that all of the Mining Structure columns are mapped to available input columns; in our example, this will involve assigning EnglishEducation to Education and EnglishOccupation to Occupation.
It is important to note that in some scenarios, it might be necessary to sort incoming rows, which can be relatively easily implemented by adding the Sort transformation to the existing Data Flow. This requirement comes into play if there is a case level column that will serve as the basis for sorting within the target structure (as well as its models) with the content type of KEY TIME or KEY SEQUENCE. Since our presentation is targeted at Integration Services rather than Data Mining, a brief explanation is warranted here for the sake of clarity.
In data mining nomenclature, the case represents an entity which is the subject of the pattern analysis, such as a customer buying a bike. Its individual characteristics (relevant to the mining process, which, in our sample model, include such properties as a customer’s age, number of cars owned, or an occupation) correspond to attributes (if we were to attempt a somewhat simplified analogy, these terms would be equivalent, respectively, to definitions of a row and columns in a relational database table). Identity of a case is determined by its case key, which, in our analogy, would be a primary key of the case table; however, quite frequently, business data being mined has more complex structure, with nested tables containing nested attributes and their own keys.
Attributes of mining structures can be described using two different mechanisms. The first of them identifies their data type and is, at least conceptually, identical to the way this is handled in traditional relational databases. The second one defines their content type, by assigning additional rules that entries in these columns must comply with. Such rules might include, for example, limitations on range of allowed values or their sequential nature. The latter is the one that applies here, with KEY TIME indicating that its values represent a time scale, which implies time series models, and KEY SEQUENCE representing sequence of events (required when working with sequence clustering models). You can easily determine whether either one of these content types is used by reviewing definitions of mining structures and mining models in SQL Server Management Studio. If you are looking for more details regarding these topics, refer to Content Types (Data Mining) and Data Types (Data Mining) articles in SQL Server 2008 R2 Books Online.