Using the Equal Areas Attribute Discretization Method
As we learned in Introduction
to Attribute Discretization, whenever we work with attributes, we
can expect to encounter two general types of values, discrete and contiguous. Discrete
values stand apart distinctly, and have clearly defined logical boundaries
between themselves. Citing the Gender attribute, within the Customer dimension
of the Adventure Works sample UDM, wherein the attribute is considered to have
only one of two discrete values,
female or male, we noted that possible values are naturally discrete for the
lions share of attributes occurring in the business world.
In contrast to discrete values, we noted that contiguous
values do not stand apart distinctly, but flow along, as if in a continuous
line. Moreover, we discussed the fact that contiguous values, especially within
large populations, can have very large numbers of possible values, and that information
consumers can find it difficult to work effectively and efficiently within such
wide ranges of values. As an example, we cited the Vacation Hours attribute, within
the Employee dimension of the Adventure Works sample UDM, an attribute which
could have a wide range of possible values, depending upon how many employees
are involved, whether there are limits on how many vacation days they can
accumulate, and considerations of this nature. (The member values are based
directly upon the values contained within the VacationHours column of the
DimEmployee table - with many of the values shared among multiple employees.)
The sheer number of values might make working with them cumbersome for
information consumers, if they are simply made available in their existing
state.
As we discussed in Introduction
to Attribute Discretization, discretization can help us to make it
easier for information consumers to work with large numbers of possible attribute
member values. As discretization creates a manageable number of groups of attribute
values that are clearly separated by boundaries, we can thereby group contiguous
values into sets of discrete values, via a system-generated collection
of consecutive dimension members known as member groups.
We also discussed, in Introduction to Attribute Discretization, that once
the discretization
process groups the attribute members into the member groups, the member groups
are then housed within a level within the dimensional hierarchy. (A given level
within a dimensional hierarchy can contain either members or member groups, but
not both.) When information consumers browse a level that contains member
groups, they see the names and cell values of the member groups. The members
generated by Analysis Services to support member groups are called grouping
members, and they look like ordinary members.
Analysis Services affords us flexibility in methods of attribute
discretization, based upon algorithms of varying complexity. The different
methods of discretization all have the same function to group contiguous
values into sets of discrete values. (Analysis Services can also effectively
discretize large numbers of discrete values, such as Social Security numbers.)
They simply manage grouping via different approaches.
Beyond the Dimension Designer, Analysis Services also supports user-defined discretization,
via data definition language (DDL), should the out of the box
approaches not meet the business needs of our local environments. Moreover, we
can alternatively implement custom discretization via the underlying data
warehouse, using views at the relational level; named calculations in the data
source view; calculated members in Analysis Services; or via other approaches.
Analysis Services supports four DiscretizationMethod
property settings, which include three pre-defined discretization methods:
-
None (default):
Analysis Services performs no grouping, and simply displays the attribute
members.
-
Automatic:
Analysis Services selects the method that best represents the data: either the EqualAreas
method or the Clusters method.
-
EqualAreas: Analysis
Services attempts to divide the members in the attribute into groups that
contain an equal number of members.
-
Clusters: Analysis
Services groups members by performing single-dimensional clustering on the
input values by using the K-Means algorithm. It uses Gaussian distributions.
This can only be used for numeric columns.
The DiscretizationMethod property determines whether Analysis
Services is to create groupings, and then determines the type of grouping that
is performed. As we learned in Introduction to
Attribute Discretization, Analysis Services does not perform any
groupings by default (the default setting for the DiscretizationMethod property
is None). When we enable Equal Areas
grouping, as we shall see in the practice session that follows, we direct Analysis
Services (via the DiscretizationMethod property) to create group ranges so that
the total population of dimension members is distributed equally across the
groups. We then process the affected dimension / cube and Analysis Services
creates group ranges, and then distributes the total population of attribute
members appropriately across those groups. As we shall see, once we specify a grouping
method, we next specify the number of groups, by using the DiscretizationBucketCount
property (its default value is zero).
As we
have noted throughout my MSSQL Server Analysis
Services column, as well as throughout my other Database Journal
series, one of the most important objectives in building a high performance Analysis
Services solution is an efficient and effective dimension design. The
identification of opportunities where we can effectively use attribute
discretization, and the effective design and placement of the member groups
that we generate thereby, can mean the provision of a much more
consumer-friendly interface for our clients and employers. We will gain hands - on exposure
to the Equal Areas
method of attribute discretization in the practice session that follows. Before we get started working
within a sample cube clone, we will need to prepare the local environment for
the practice session. We will take steps to accomplish this within the section
that follows.
Preparation: Locate and Open the Sample Basic UDM Created Earlier
In Dimensional Model
Components: Dimensions Part I, we created a sample basic Analysis Services database within which to
perform the steps of the practice sessions we set out to undertake in the
various articles of this subseries. Once we had ascertained that the new
practice database appeared to be in place, and once we had renamed it to ANSYS065_Basic
AS DB, we began our examination of dimension properties. We next undertook an examination
of attributes within the same practice environment, which we will now access (as
we did within the earlier articles of this subseries)) by taking the following
steps within the SQL Server Business Intelligence Development Studio.
NOTE: Please access the Analysis
Services database which we prepared in Dimensional Model Components: Dimensions Part I (and have used in subsequent articles)
before proceeding with this article. If you have not completed the preparation
to which I refer, or if you cannot locate / access the Analysis Services
database with which we worked in the referenced previous articles, please
consider taking the preparation steps provided in Dimensional Model Components: Dimensions Part I before continuing, and
prospectively saving the objects with which you work, so as to avoid the need
to repeat the preparation process we have already undertaken for subsequent
related articles within this subseries.
1.
Click Start.
2.
Navigate to,
and click, the SQL
Server Business Intelligence Development Studio, as appropriate.
We
briefly see a splash page that lists the components installed on the PC, and
then Visual Studio .NET 2005 opens at the Start page.
3.
Close the Start
page, if desired.
4.
Select File -> Open from the main menu.
5.
Click Analysis
Services Database ... from the cascading menu, as shown in Illustration 1, as
necessary (you may have left the Analysis Services database within the project
from a prior article).
Illustration 1: Opening the Analysis Services Database ...
The Connect
to Database dialog appears.
6.
Ensuring that
the Connect to existing database radio button atop the dialog is selected, type
the Analysis Server name into the Server input box (also near the top of the
dialog).
7.
Using the
selector just beneath, labeled Database, select ANSYS065_Basic AS DB, as depicted
in Illustration 2.
Illustration 2: Selecting the Basic Analysis Services Database ...
8.
Leaving other
settings on the dialog at default, click OK.
SQL
Server Business Intelligence Development Studio briefly reads the database from
the Analysis Server, and then we see the Solution Explorer populated with the
database objects. Having overviewed attribute discretization in this and the
previous article, we will now get some hands-on exposure to the use of Equal
Areas attribute discretization for the members of a representative dimension attribute
within our practice UDM.