Using the Clusters Attribute Discretization Method
As we learned in Introduction
to Attribute Discretization, whenever we work with attributes, we
can expect to encounter two general types of values, discrete and contiguous. Discrete
values stand apart distinctly, and have clearly defined logical boundaries
between themselves. Citing the Gender attribute, within the Customer dimension
of the Adventure Works sample UDM, wherein the attribute is considered to have
only one of two discrete values,
female or male, we noted that possible values are naturally discrete for the
lions share of attributes occurring in the business world.
In contrast to discrete values, we noted that contiguous
values do not stand apart distinctly, but flow along, as if in a continuous
line. Moreover, we discussed the fact that contiguous values, especially within
large populations, can have very large numbers of possible values, and that information
consumers can find it difficult to work effectively and efficiently within such
wide ranges of values. As an example, we cited the Sick Leave Hours attribute, within
the Employee dimension of the Adventure Works sample UDM, an attribute which
could have a wide range of possible values, depending upon how many employees
are involved, whether there are limits on how many sick leave days they can
accumulate, and considerations of this nature. (The member values are based
directly upon the values contained within the SickLeaveHours column of the
DimEmployee table - with many of the values shared among multiple employees.)
The sheer number of values might make working with them cumbersome for
information consumers, if they are simply made available in their existing
state.
As we discussed in Introduction
to Attribute Discretization, discretization can help us to make it
easier for information consumers to work with large numbers of possible attribute
member values. As discretization creates a manageable number of groups of attribute
values that are clearly separated by boundaries, we can thereby group contiguous
values into sets of discrete values, via a system-generated collection
of consecutive dimension members known as member groups.
We also discussed, in Introduction
to Attribute Discretization, that once the discretization process groups the attribute
members into the member groups, the member groups are then housed within a
level within the dimensional hierarchy. (A given level within a dimensional
hierarchy can contain either members or member groups, but not both.) When information
consumers browse a level that contains member groups, they see the names and
cell values of the member groups. The members generated by Analysis Services to
support member groups are called grouping members, and they look like ordinary
members.
Analysis Services affords us flexibility in methods of attribute
discretization, based upon algorithms of varying complexity. The different
methods of discretization all have the same function to group contiguous
values into sets of discrete values. (Analysis Services can also effectively
discretize large numbers of discrete values, such as Social Security numbers.)
The methods simply manage grouping via different approaches.
Beyond the Dimension Designer, Analysis Services also
supports user-defined discretization, via data definition language (DDL),
should the out of the box approaches not meet the business needs of our
local environments. Moreover, we can alternatively implement custom discretization
via the underlying data warehouse, using views at the relational level; named
calculations in the data source view; calculated members in Analysis Services;
or via other approaches.
Analysis Services supports four DiscretizationMethod
property settings, which include three pre-defined discretization methods:
-
None (default):
Analysis Services performs no grouping, and simply displays the attribute
members.
-
Automatic:
Analysis Services selects the method that best represents the data: either the EqualAreas
method or the Clusters method.
-
EqualAreas: Analysis
Services attempts to divide the members in the attribute into groups that
contain an equal number of members.
-
Clusters: Analysis
Services groups members by performing single-dimensional clustering on the
input values by using the K-Means algorithm. It uses Gaussian distributions. This
can only be used for numeric columns.
The DiscretizationMethod property determines whether Analysis
Services is to create groupings, and then determines the type of grouping that
is performed. As we learned in Introduction to Attribute Discretization,
Analysis Services does not perform any groupings by default (the default
setting for the DiscretizationMethod property is None). When we enable Clusters
grouping, as we shall see in the practice session that follows, we direct Analysis
Services (via the DiscretizationMethod property) to group members by performing
single-dimensional clustering on the input values.
We then process the affected dimension / cube and, according to the SQL
Server 2005 Books Online, the associated Analysis Services algorithm divides
the data into groups by sampling a random, 1000-row subset of the data (the
training data), initializing to a number of random points, and then running
several iterations of the Microsoft Clustering algorithm using the Expectation
Maximization (EM) clustering method. The Clusters method is particularly useful
because it works on any distribution curve. A disadvantage, however, lies in
the fact that this discretization method requires more processing time than the
other methods. Another limitation is that the Clusters method, as we have
noted, can only be applied to numeric data columns.
NOTE: If sampling by the algorithm is
undesirable, we can use the Equal Areas discretization method. See Attribute Discretization: Using the Equal Areas Method
for details.
As we shall see, and as we have seen with the other two
discretization methods in the articles I published to introduce each, once we specify
a grouping method, we next specify the number of groups, by using the DiscretizationBucketCount
property (its default value is zero).
As we
have noted throughout my MSSQL Server Analysis
Services column, as well as
throughout my other Database Journal series, one of the most important
objectives in building a high performance Analysis Services solution is an
efficient and effective dimension design. The identification of opportunities
where we can effectively use attribute discretization, and the effective design
and placement of the member groups that we generate thereby, can mean the
provision of a much more consumer-friendly interface for our clients and
employers. We will
gain hands - on exposure to the Clusters method of attribute discretization in the practice
session that follows. Before
we get started working within a sample cube clone, we will need to prepare the
local environment for the practice session. We will take steps to accomplish
this within the section that follows.
Preparation: Locate and Open the Sample Basic UDM Created Earlier
In Dimensional Model
Components: Dimensions Part I, we created a sample basic Analysis Services database within which to
perform the steps of the practice sessions we set out to undertake in the
various articles of this subseries. Once we had ascertained that the new practice
database was in place, and once we had renamed it to ANSYS065_Basic AS DB, we
began our examination of dimension properties. We continued with our
examination of attributes within the same practice environment, which we will
now access (as we did within the earlier articles of this subseries)) by taking
the following steps within the SQL Server Business Intelligence Development
Studio.
NOTE: Please access the Analysis
Services database which we prepared in Dimensional Model Components: Dimensions Part I (and have used in subsequent
articles) before proceeding with this article. If you have not completed the
preparation to which I refer, or if you cannot locate / access the Analysis
Services database with which we worked in the referenced previous articles,
please consider taking the preparation steps provided in Dimensional Model Components: Dimensions Part I before continuing, and
prospectively saving the objects with which you work, so as to avoid the need
to repeat the preparation process we have already undertaken for subsequent
related articles within this subseries.
1.
Click Start.
2.
Navigate to,
and click, the SQL
Server Business Intelligence Development Studio, as appropriate.
We
briefly see a splash page that lists the components installed on the PC, and
then Visual Studio .NET 2005 opens at the Start page.
3.
Close the Start
page, if desired.
4.
Select File -à Open from the main menu.
5.
Click Analysis
Services Database ... from the cascading menu, as shown in Illustration 1.
Illustration 1: Opening the Analysis Services Database ...
The Connect
to Database dialog appears.
6.
Ensuring that
the Connect to existing database radio button atop the dialog is selected, type
the Analysis Server name into the Server input box (also near the top of the
dialog).
7.
Using the
selector just beneath, labeled Database, select ANSYS065_Basic AS DB, as
depicted in Illustration
2.
Illustration 2: Selecting the Basic Analysis Services Database ...
8.
Leaving other
settings on the dialog at default, click OK.
SQL
Server Business Intelligence Development Studio briefly reads the database from
the Analysis Server, and then we see the Solution Explorer populated with the
database objects. Having overviewed attribute discretization in this and the
previous article, we will now get some hands-on exposure to the use of Automatic
attribute discretization for the members of a representative dimension attribute
within our practice UDM.