Attribute Discretization: Using the "Clusters" Method - Page 2

June 23, 2009

Using the Clusters Attribute Discretization Method

As we learned in Introduction to Attribute Discretization, whenever we work with attributes, we can expect to encounter two general types of values, discrete and contiguous. Discrete values stand apart distinctly, and have clearly defined logical “boundaries” between themselves. Citing the Gender attribute, within the Customer dimension of the Adventure Works sample UDM, wherein the attribute is considered to have only one of two discrete values, female or male, we noted that possible values are naturally discrete for the lion’s share of attributes occurring in the business world.

In contrast to discrete values, we noted that contiguous values do not stand apart distinctly, but flow along, as if in a continuous line. Moreover, we discussed the fact that contiguous values, especially within large populations, can have very large numbers of possible values, and that information consumers can find it difficult to work effectively and efficiently within such wide ranges of values. As an example, we cited the Sick Leave Hours attribute, within the Employee dimension of the Adventure Works sample UDM, an attribute which could have a wide range of possible values, depending upon how many employees are involved, whether there are limits on how many sick leave days they can accumulate, and considerations of this nature. (The member values are based directly upon the values contained within the SickLeaveHours column of the DimEmployee table - with many of the values shared among multiple employees.) The sheer number of values might make working with them cumbersome for information consumers, if they are simply made available in their existing state.

As we discussed in Introduction to Attribute Discretization, discretization can help us to make it easier for information consumers to work with large numbers of possible attribute member values. As discretization creates a manageable number of groups of attribute values that are clearly separated by boundaries, we can thereby group contiguous values into sets of discrete values, via a system-generated collection of consecutive dimension members known as member groups.

We also discussed, in Introduction to Attribute Discretization, that once the discretization process groups the attribute members into the member groups, the member groups are then housed within a level within the dimensional hierarchy. (A given level within a dimensional hierarchy can contain either members or member groups, but not both.) When information consumers browse a level that contains member groups, they see the names and cell values of the member groups. The members generated by Analysis Services to support member groups are called grouping members, and they look like ordinary members.

Analysis Services affords us flexibility in methods of attribute discretization, based upon algorithms of varying complexity. The different methods of discretization all have the same function – to group contiguous values into sets of discrete values. (Analysis Services can also effectively discretize large numbers of discrete values, such as Social Security numbers.) The methods simply manage grouping via different approaches.

Beyond the Dimension Designer, Analysis Services also supports user-defined discretization, via data definition language (DDL), should the “out – of – the – box” approaches not meet the business needs of our local environments. Moreover, we can alternatively implement custom discretization via the underlying data warehouse, using views at the relational level; named calculations in the data source view; calculated members in Analysis Services; or via other approaches.

Analysis Services supports four DiscretizationMethod property settings, which include three pre-defined discretization methods:

• None (default): Analysis Services performs no grouping, and simply displays the attribute members.
• Automatic: Analysis Services selects the method that best represents the data: either the EqualAreas method or the Clusters method.
• EqualAreas: Analysis Services attempts to divide the members in the attribute into groups that contain an equal number of members.
• Clusters: Analysis Services groups members by performing single-dimensional clustering on the input values by using the K-Means algorithm. It uses Gaussian distributions. This can only be used for numeric columns.

The DiscretizationMethod property determines whether Analysis Services is to create groupings, and then determines the type of grouping that is performed. As we learned in Introduction to Attribute Discretization, Analysis Services does not perform any groupings by default (the default setting for the DiscretizationMethod property is “None”). When we enable Clusters grouping, as we shall see in the practice session that follows, we direct Analysis Services (via the DiscretizationMethod property) to group members by performing single-dimensional clustering on the input values. We then process the affected dimension / cube and, according to the SQL Server 2005 Books Online, the associated Analysis Services algorithm divides the data into groups by sampling a random, 1000-row subset of the data (the “training data”), initializing to a number of random points, and then running several iterations of the Microsoft Clustering algorithm using the Expectation Maximization (EM) clustering method. The Clusters method is particularly useful because it works on any distribution curve. A disadvantage, however, lies in the fact that this discretization method requires more processing time than the other methods. Another limitation is that the Clusters method, as we have noted, can only be applied to numeric data columns.

NOTE: If sampling by the algorithm is undesirable, we can use the Equal Areas discretization method. See Attribute Discretization: Using the Equal Areas Method for details.

As we shall see, and as we have seen with the other two discretization methods in the articles I published to introduce each, once we specify a grouping method, we next specify the number of groups, by using the DiscretizationBucketCount property (its default value is zero).

As we have noted throughout my MSSQL Server Analysis Services column, as well as throughout my other Database Journal series’, one of the most important objectives in building a high performance Analysis Services solution is an efficient and effective dimension design. The identification of opportunities where we can effectively use attribute discretization, and the effective design and placement of the member groups that we generate thereby, can mean the provision of a much more consumer-friendly interface for our clients and employers. We will gain hands - on exposure to the Clusters method of attribute discretization in the practice session that follows. Before we get started working within a sample cube clone, we will need to prepare the local environment for the practice session. We will take steps to accomplish this within the section that follows.

Preparation: Locate and Open the Sample Basic UDM Created Earlier

In Dimensional Model Components: Dimensions Part I, we created a sample basic Analysis Services database within which to perform the steps of the practice sessions we set out to undertake in the various articles of this subseries. Once we had ascertained that the new practice database was in place, and once we had renamed it to ANSYS065_Basic AS DB, we began our examination of dimension properties. We continued with our examination of attributes within the same practice environment, which we will now access (as we did within the earlier articles of this subseries)) by taking the following steps within the SQL Server Business Intelligence Development Studio.

NOTE: Please access the Analysis Services database which we prepared in Dimensional Model Components: Dimensions Part I (and have used in subsequent articles) before proceeding with this article. If you have not completed the preparation to which I refer, or if you cannot locate / access the Analysis Services database with which we worked in the referenced previous articles, please consider taking the preparation steps provided in Dimensional Model Components: Dimensions Part I before continuing, and prospectively saving the objects with which you work, so as to avoid the need to repeat the preparation process we have already undertaken for subsequent related articles within this subseries.

1.  Click Start.

2.  Navigate to, and click, the SQL Server Business Intelligence Development Studio, as appropriate.

We briefly see a splash page that lists the components installed on the PC, and then Visual Studio .NET 2005 opens at the Start page.

3.  Close the Start page, if desired.

4.  Select File -à Open from the main menu.

5.  Click Analysis Services Database ... from the cascading menu, as shown in Illustration 1.

Illustration 1: Opening the Analysis Services Database ...

The Connect to Database dialog appears.

6.  Ensuring that the Connect to existing database radio button atop the dialog is selected, type the Analysis Server name into the Server input box (also near the top of the dialog).

7.  Using the selector just beneath, labeled Database, select ANSYS065_Basic AS DB, as depicted in Illustration 2.

Illustration 2: Selecting the Basic Analysis Services Database ...

8.  Leaving other settings on the dialog at default, click OK.

SQL Server Business Intelligence Development Studio briefly reads the database from the Analysis Server, and then we see the Solution Explorer populated with the database objects. Having overviewed attribute discretization in this and the previous article, we will now get some hands-on exposure to the use of Automatic attribute discretization for the members of a representative dimension attribute within our practice UDM.