Manage Distinct Count with a Virtual Cube
Overview and Discussion
We mentioned in our
introductory article, Distinct
Count Basics: Two Perspectives, that it is common in the
business environment to encounter
the need to quantify precisely the members of various sets of data. A
simple example, and one upon which we will expand in our hypothetical business
requirement, involves the number of customers who are purchasing a product, or
group of products, sold by an organization. We learned in the previous article
that we can exploit settings within MSAS' Analysis Manager, as well as take
more advanced approaches, to extend our analysis even further, and leverage
MSAS to reach our specific business objectives.
We discussed why distinct
counts differ from simple counts, noting that a distinct count
might comprise, as an example, a count of the different products that
were purchased, or of the individual customers who purchased our
products. To review our discussion, COUNT(), in providing a total
number of, say, customers, would also be providing multiple counts of the
same customers, because customers will have, in most cases, purchased
multiple products, multiple times. To reach our objective of counting different
customers, then, we would need to count each different customer, only once.
As we noted in our previous session, using COUNT() when DISTINCTCOUNT()
is required not only misstates the number of different customers, but it
also likely renders averages, and other metrics similarly based upon the count
value, misleading or totally useless in our analysis efforts.
In this article, we will discuss
and practice a solution for meeting an illustrative need, which expands upon
the customer example to which we have alluded. Our practice example will also
highlight the performance challenges that can arise in simply addressing such
requirements in an intuitive manner. We will then take steps to reshape our
solution to take advantage of another approach that meets the need, while
bettering the performance of the overall solution.
Considerations and Comments
For purposes of this
exercise, we will be working with the Warehouse cube, within the FoodMart
2000 MSAS database; these working samples accompany a typical installation
of MSAS. If the samples are not installed in, or have been removed from, your
environment, they can be obtained from the installation CD, as well as from the
Analysis Services section of the Microsoft website. If you prefer not to
alter the structure of your sample cubes as they currently exist, make copies
of the cube we reference in the article before beginning the practice
exercises. For instructions on copying cubes, see the Preparation
section of Introduction
to MSSQL Server 2000 Analysis Services: Semi-Additive Measures and Periodic
Balances.
Hands-On Procedure
We will begin with a scenario
that illustrates a requirement for a distinct count, using a
hypothetical business need to add practical value. Let's say that a group of
information consumers within the FoodMart organization have approached
us with an information request they wish to meet using the Sales cube.
The consumers want to be able to analyze the performance of products, by
category, both in terms of dollar sales, and number of different customers
contributing to those sales, for the third quarter (Q3) of 1997.
In addition, they wish to see an "average sales per (distinct) customer"
within the same dataset.
We will initially attempt
to meet the needs of the consumers with relatively simple MDX, having
introduced both MSAS and MDX approaches in Distinct Count Basics: Two Perspectives (see the steps provided there, if
you have joined the series with this article, and find the initial query we
present to be less than intuitive.)
Initial
Approach via MDX
Let's initialize the MDX Sample Application, the platform from which we
perform many practice exercises within the articles of our series. (We choose
it because any organization that has installed MSAS has access to the Sample
Application). We will create our initial query by taking the following
steps:
1.
Start the MDX
Sample Application.
We are
initially greeted by the Connect dialog, shown in Illustration 1.
Illustration 1:
The Connect Dialog for the MDX Sample Application
The
illustration above depicts the name of my server, MOTHER1, and properly
indicates that we will be connecting via the MSOLAP provider (the
default).
The MDX
Sample Application window appears.
A
blank Query pane appears.
4.
Ensure that FoodMart
2000 is selected as the database name in the DB box of the toolbar.
5.
Select the Sales
cube in the Cube drop-down list box.
The MDX
Sample Application window should resemble that depicted in Illustration
2, complete with the information from the Sales cube displaying in
the Metadata tree (left section of the Metadata pane).
Illustration 2: The MDX Sample Application Window
(Compressed View)
We
will begin creating our query with a focus on returning results to meet the
expressed business need of the information consumers. We will construct two calculated
members / measures, one to contain the distinct count of the Customers,
and one to calculate the average sale for each Product Category, per
individual Customer. We will then SELECT the two calculated
measures, presenting them to the immediate right of the Unit Sales
measure for each respective Product Category.
We
will retrieve a dataset with the measure / calculated measures forming the column
axis, and the Product Category forming the row axis.
1.
Create the
following new query:
-- ANSYS32-1 Initial Attempt at Distinct Customer Dataset
WITH
MEMBER
[Measures].[Distinct Customers]
AS
'COUNT(CrossJoin({[Unit Sales]},
Descendants ([Customers].CurrentMember,
[Customers].[Name])), ExcludeEmpty)'
MEMBER
[Measures].[Avg Sales per Customer]
AS
'[Measures].[Unit Sales]/[Measures].[Distinct Customers]'
SELECT
{ [Measures].[Unit Sales], [Measures].[Distinct Customers],
[Measures].[Avg Sales per Customer]} on Columns,
{[Product Category].Members} ON ROWS
FROM
[SALES]
WHERE
([Time].[1997].[Q3])
The above represents an
attempt to meet the information consumers' objectives with what appears to be
the straightforward use of the DISTINCTCOUNT() function within a
calculated member, to contain the count of the distinct Customers. We
then create a second calculated member based upon the first, which we divide
into the Unit Sales measure to derive the Average Sales per
(individual) customer, as requested by the intended audience. We SELECT
all three into the desired matrix to render the desired presentation.
The calculated member Distinct
Customers embodies the "heavy lifting" in the query. We used the
following definition (within the AS clause string for calculated member Distinct
Customers):
'COUNT(CrossJoin({[Unit Sales]},
Descendants ([Customers].CurrentMember,
[Customers].[Name])), ExcludeEmpty)'
to count the non-null Sales
/ Customer member tuples that it found, thereby deriving the number of
customers. Because we wish to avoid counting all customer names (the
lowest level of the Customer hierarchy), regardless of our level
position in the hierarchy, we inserted the Descendants() function
shown; this forces a limitation upon the count to solely the customers
under the current member of the Customers dimension.
2.
Execute the
query using the Run Query button.
The
results dataset appears as partially shown in Illustration 3.

Illustration 3:
The Results Dataset (Partial View)
The
first thing that we notice, after clicking Run Query, is that this query
takes a little longer to run than many of the "sample" queries we
have created in past articles. As a matter of fact, this is exactly the
observation that I am hoping even those new to MDX in general will make. The
query provides the data to meet the requirements of the information consumers,
but performance could become a problem.
The
overhead generated in the query is due to the requirement for MSAS to perform a
runtime assessment of each customer member, and there are many members. While
this overhead may not be unduly troublesome from the perspective of our sample
data, performance will be degraded far more within the context of the sizes of
member populations that exist in many production environments. The performance
degradation we have witnessed in our tiny sample cube is extrapolated to those
larger populations, to an extent that becomes very real to analysts and other
information consumers that rely upon the system to provide data in a reasonable
response time.
3.
Save the query
as ANSYS32-1 in a convenient location.
4.
Close the Sample
Application.
Our
next step will be to examine an option for mitigating the performance hit
suffered within the straightforward application of DISTINCTCOUNT()
within our query.