Data Mining Algorithms: Microsoft SQL Server 2000 vs. "Yukon" SQL Server
February 3, 2004
This article describes a well-known concept, (Data Mining algorithms, built into Microsoft SQL Server 2000 Analysis Services) and what I would like to see in the final "Yukon" SQL Server release (i.e. my expectations in a field of new / improved data mining algorithms).
What do we know already? According to SQL Server 2000 Books On-Line: "Central to the data mining process, data mining algorithms determine how the cases for a data mining model are analyzed. Data mining model algorithms provide the decision - making capabilities needed to classify, segment, associate and analyze data for the processing of data mining columns that provide predictive, variance, or probability information about the case set...
Many data mining algorithms are goal-oriented; given a case set, a data-mining algorithm will predict something about the case, usually an attribute of the case itself. Most algorithms require a training set of cases where the attributes to be predicted are already known, at which point the algorithm constructs a data mining model capable of predicting these attributes for cases in which the attributes are unknown".
Two data mining algorithms are built-in into Microsoft SQL Server 2000 Analysis Services: Microsoft Decision Trees and Microsoft Clustering.
Just a theory . . .
A set of similar cases.
The development of a model that labels a new instance as a member of a group of similar records (a cluster). See clustering algorithms. For example, clustering could be used by a company to group customers according to income, age, prior purchase behavior. Cluster detection rarely provides actionable information, but rather feeds information to other data mining tasks. (Reference: Barry, M. and Linoff, G. Data Mining Techniques. 1997. "Chapter 10 - Automatic Cluster Detection).
Given a data set, these algorithms induce a model that classifies a new instance into a group of similar instances. Commonly the algorithms require that the number of (c) clusters to be identified is pre-specified. E.g. find the c=10 best clusters. Given a distance metric, these algorithms will try to find groups of records that have low distances within the cluster but large distances with the records of other clusters. Reference: Hair, J. F. et al, (1998) "Multivariate Data Analysis", 5th edition, Chapter 9, pages 469-517).
A model made up of a root, branches and leaves. Decision trees are similar to organization charts, with statistical information presented at each node.
Decision Tree Algorithm
An algorithm that generates classification or estimation models from the fields of Machine Learning and Statistics. The basic approach of the algorithm is to use a splitting criterion to determine the most predictive factor and place it as the first decision point in the tree (the root), and continually perform this search for predictive factors to build the branches of the tree until there is no more data to continue with. Tree pruning raises accuracy on noisy data and can be performed as the tree is being constructed (pre-pruning), or after the construction (post-pruning). The algorithm is commonly used for classification problems that require the model represented in a human-readable model . . .
How does SQL Server Books On-Line describe both of these algorithms? Let's take a look . . .
Microsoft Decision Trees
The Microsoft Decision Trees algorithm uses classification techniques to analyze data. It then constructs one or more decision trees that can be used to predict attributes or values for new data. For example, you can use this algorithm to analyze credit history data and predict the credit risk of new applicants . . .
The Microsoft Clustering algorithm uses the nearest neighbor method to group records into clusters that share similar characteristics. Often, these characteristics may be hidden or not intuitive . . ."
That's all, (only 2 algorithms) for the current SQL Server release. What about "Yukon" SQL Server? For now, it is an unknown, but I would like to see the following in "Yukon":
Finally, for those who are interested in Business Intelligence / Data Mining topics I'd like to provide a few excerpts (on my opinion they contains interesting links and info) from a past Microsoft TechNet Chat (check the full Technet chat transcript at: http://www.microsoft.com/technet/treeview/default.asp?url=/technet/itcommunity/chats/trans/sql/sql0123.asp): .
Q: Hi! Can you say anything about the new Data Mining algorithms to be included in Yukon?
A: We will have some new DM algorithms in Yukon, however, At this stage, we are not yet ready to give the list of new features in Yukon as we are in the middle of development cycle..
Q: Which third party companies' tools work best with MS Data Mining Tools?
A: You can try Angoss and DBMiner's products. They both have algorithm providers. Angoss also has some UI controls.
A: Here is the link to the Data Mining Performance paper: http://www.microsoft.com/SQL/evaluation/compare/AnalysisDMWP.asp
Q: From my own experience as a SQL Server instructor, I have seen that most SQL Server users do not think about the advantages they could have by using data mining techniques. Perhaps if they had some clear case studies about it, you could
A: Actually MSDN just posted an excellent example using Microsoft Data Mining for cross-sell at an online bookstore - check out http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnduwon/html/D52design.asp
good resources are the newsgroup microsoft.public.sqlserver.datamining and
the community site http://communities.msn.com/AnalysisServicesDataMining.
These are monitored frequently by the DM dev team . . ."
(check the full Technet chat transcript at:
Just before writing this article for DatabaseJournal.com, I found an article, describing what to expect from the "Yukon" SQL Server (in a field of new / improved Data Mining algorithms). Check the Technet online article at http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/sql/next/DWSQLSY.asp, authored by Joy Mundy. "Overview of Business Intelligence and Data Warehousing in SQL Server Yukon" contains an overview of the "Yukon" SQL Server's new features from a Data Warehousing professional's point of view.