This article describes a well-known
concept, (Data Mining algorithms, built into Microsoft SQL Server 2000 Analysis
Services) and what I would like to see in the final "Yukon" SQL Server
release (i.e. my expectations in a field of new / improved data mining
algorithms).
What do we know
already? According to SQL Server 2000 Books On-Line: "Central to
the data mining process, data mining algorithms determine how the cases for a
data mining model are analyzed. Data mining model algorithms provide the
decision – making capabilities needed to classify, segment, associate and
analyze data for the processing of data mining columns that provide predictive,
variance, or probability information about the case set…
Many data mining algorithms
are goal-oriented; given a case set, a data-mining algorithm will predict
something about the case, usually an attribute of the case itself. Most
algorithms require a training set of cases where the attributes to be predicted
are already known, at which point the algorithm constructs a data mining model
capable of predicting these attributes for cases in which the attributes are
unknown".
Two data mining
algorithms are built-in into Microsoft SQL Server 2000 Analysis Services:
Microsoft Decision Trees and Microsoft Clustering.
Just a theory .
. .
Cluster
A set of similar cases.
Clustering
The development of a model that labels a new
instance as a member of a group of similar records (a cluster). See clustering
algorithms. For example, clustering could be used by a company to group
customers according to income, age, prior purchase behavior. Cluster detection
rarely provides actionable information, but rather feeds information to other
data mining tasks. (Reference: Barry, M. and Linoff, G. Data Mining
Techniques. 1997. "Chapter 10 – Automatic Cluster Detection).
Clustering Algorithms
Given a data set, these algorithms induce a
model that classifies a new instance into a group of similar instances. Commonly
the algorithms require that the number of (c) clusters to be identified is pre-specified.
E.g. find the c=10 best clusters. Given a distance metric, these algorithms
will try to find groups of records that have low distances within the cluster
but large distances with the records of other clusters. Reference: Hair, J.
F. et al, (1998) "Multivariate Data Analysis", 5th edition, Chapter
9, pages 469-517).
Decision Tree
A model made up of a root, branches and leaves.
Decision trees are similar to organization charts, with statistical information
presented at each node.
Decision Tree Algorithm
An algorithm that
generates classification or estimation models from the fields of Machine
Learning and Statistics. The basic approach of the algorithm is to use a
splitting criterion to determine the most predictive factor and place it as the
first decision point in the tree (the root), and continually perform this
search for predictive factors to build the branches of the tree until there is
no more data to continue with. Tree pruning raises accuracy on noisy data and
can be performed as the tree is being constructed (pre-pruning), or after the
construction (post-pruning). The algorithm is commonly used for classification
problems that require the model represented in a human-readable model . . .
How does SQL Server
Books On-Line describe both of these algorithms? Let’s take a look . . .
Microsoft Decision Trees
The Microsoft
Decision Trees algorithm uses classification techniques to analyze data. It
then constructs one or more decision trees that can be used to predict
attributes or values for new data. For example, you can use this algorithm to
analyze credit history data and predict the credit risk of new applicants . . .
Microsoft Clustering
The Microsoft
Clustering algorithm uses the nearest neighbor method to group records into
clusters that share similar characteristics. Often, these characteristics may
be hidden or not intuitive . . ."
That’s all,
(only 2 algorithms) for the current SQL Server release. What about "Yukon"
SQL Server? For now, it is an unknown, but I would like to see the following
in "Yukon":
- The ability to use data minig algorithms from third party providers as well as the ability to integrate them into the SQL Server environment;
- A set of a NEW (!) data mining algorithms, to build mining models more quickly than now;
- Data mining algorithms, combining both sequence analysis and clustering analysis;
- Data mining algorithms based on a modern “artificial intelligence” term.
Finally,
for those who are interested in Business Intelligence / Data
Mining topics I’d like to provide a few excerpts (on my opinion they contains
interesting links and info) from a past Microsoft TechNet Chat (check the full Technet
chat transcript at: http://www.microsoft.com/technet/treeview/default.asp?url=/technet/itcommunity/chats/trans/sql/sql0123.asp):
.
Q: Hi! Can you say
anything about the new Data Mining algorithms to be included in Yukon?
A: We will have
some new DM algorithms in Yukon, however, At this stage, we are
not yet ready to give the list of new features in Yukon as we are
in the middle of development cycle..
Q: Which third
party companies’ tools work best with MS Data Mining Tools?
A: You can try Angoss
and DBMiner’s products. They both have algorithm providers. Angoss also has
some UI controls.
A: Here is the
link to the Data Mining Performance paper: http://www.microsoft.com/SQL/evaluation/compare/AnalysisDMWP.asp
Q: From my own
experience as a SQL Server instructor, I have seen that most SQL Server users
do not think about the advantages they could have by using data mining
techniques. Perhaps if they had some clear case studies about it, you could
A: Actually MSDN
just posted an excellent example using Microsoft Data Mining for cross-sell at
an online bookstore – check out http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnduwon/html/D52design.asp
Other
good resources are the newsgroup microsoft.public.sqlserver.datamining and
the community site http://communities.msn.com/AnalysisServicesDataMining.
These are monitored frequently by the DM dev team . . ."
(check the full Technet chat transcript at:
http://www.microsoft.com/technet/treeview/default.asp?url=/technet/itcommunity/chats/trans/sql/sql0123.asp)
Just
before writing this article for DatabaseJournal.com, I found an article, describing
what to expect from the "Yukon" SQL Server (in a field of new /
improved Data Mining algorithms). Check the Technet online article at
http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/sql/next/DWSQLSY.asp,
authored by Joy Mundy. "Overview of Business Intelligence and Data
Warehousing in SQL Server Yukon" contains an overview of the "Yukon"
SQL Server’s new features from a Data Warehousing professional’s point of view.