This article describes a well-known
concept, (Data Mining algorithms, built into Microsoft SQL Server 2000 Analysis
Services) and what I would like to see in the final "Yukon" SQL Server
release (i.e. my expectations in a field of new / improved data mining
algorithms).
What do we know
already? According to SQL Server 2000 Books On-Line: "Central to
the data mining process, data mining algorithms determine how the cases for a
data mining model are analyzed. Data mining model algorithms provide the
decision - making capabilities needed to classify, segment, associate and
analyze data for the processing of data mining columns that provide predictive,
variance, or probability information about the case set...
Many data mining algorithms
are goal-oriented; given a case set, a data-mining algorithm will predict
something about the case, usually an attribute of the case itself. Most
algorithms require a training set of cases where the attributes to be predicted
are already known, at which point the algorithm constructs a data mining model
capable of predicting these attributes for cases in which the attributes are
unknown".
Two data mining
algorithms are built-in into Microsoft SQL Server 2000 Analysis Services:
Microsoft Decision Trees and Microsoft Clustering.
Just a theory .
. .
Cluster
A set of similar cases.
Clustering
The development of a model that labels a new
instance as a member of a group of similar records (a cluster). See clustering
algorithms. For example, clustering could be used by a company to group
customers according to income, age, prior purchase behavior. Cluster detection
rarely provides actionable information, but rather feeds information to other
data mining tasks. (Reference: Barry, M. and Linoff, G. Data Mining
Techniques. 1997. "Chapter 10 - Automatic Cluster Detection).
Clustering Algorithms
Given a data set, these algorithms induce a
model that classifies a new instance into a group of similar instances. Commonly
the algorithms require that the number of (c) clusters to be identified is pre-specified.
E.g. find the c=10 best clusters. Given a distance metric, these algorithms
will try to find groups of records that have low distances within the cluster
but large distances with the records of other clusters. Reference: Hair, J.
F. et al, (1998) "Multivariate Data Analysis", 5th edition, Chapter
9, pages 469-517).
Decision Tree
A model made up of a root, branches and leaves.
Decision trees are similar to organization charts, with statistical information
presented at each node.
Decision Tree Algorithm
An algorithm that
generates classification or estimation models from the fields of Machine
Learning and Statistics. The basic approach of the algorithm is to use a
splitting criterion to determine the most predictive factor and place it as the
first decision point in the tree (the root), and continually perform this
search for predictive factors to build the branches of the tree until there is
no more data to continue with. Tree pruning raises accuracy on noisy data and
can be performed as the tree is being constructed (pre-pruning), or after the
construction (post-pruning). The algorithm is commonly used for classification
problems that require the model represented in a human-readable model . . .
How does SQL Server
Books On-Line describe both of these algorithms? Let's take a look . . .
Microsoft Decision Trees
The Microsoft
Decision Trees algorithm uses classification techniques to analyze data. It
then constructs one or more decision trees that can be used to predict
attributes or values for new data. For example, you can use this algorithm to
analyze credit history data and predict the credit risk of new applicants . . .
Microsoft Clustering
The Microsoft
Clustering algorithm uses the nearest neighbor method to group records into
clusters that share similar characteristics. Often, these characteristics may
be hidden or not intuitive . . ."
That's all,
(only 2 algorithms) for the current SQL Server release. What about "Yukon"
SQL Server? For now, it is an unknown, but I would like to see the following
in "Yukon":
- The ability to use data minig algorithms from third party providers as well as the ability to integrate them into the SQL Server environment;
- A set of a NEW (!) data mining algorithms, to build mining models more quickly than now;
- Data mining algorithms, combining both sequence analysis and clustering analysis;
- Data mining algorithms based on a modern "artificial intelligence" term.
Finally,
for those who are interested in Business Intelligence / Data
Mining topics I'd like to provide a few excerpts (on my opinion they contains
interesting links and info) from a past Microsoft TechNet Chat (check the full Technet
chat transcript at: http://www.microsoft.com/technet/treeview/default.asp?url=/technet/itcommunity/chats/trans/sql/sql0123.asp):
.
Just
before writing this article for DatabaseJournal.com, I found an article, describing
what to expect from the "Yukon" SQL Server (in a field of new /
improved Data Mining algorithms). Check the Technet online article at
http://www.microsoft.com/technet/treeview/default.asp?url=/technet/prodtechnol/sql/next/DWSQLSY.asp,
authored by Joy Mundy. "Overview of Business Intelligence and Data
Warehousing in SQL Server Yukon" contains an overview of the "Yukon"
SQL Server's new features from a Data Warehousing professional's point of view.
»
See All Articles by Columnist Alexzander Nepomnjashiy