The CROSSJOIN() Function: Breaking Bottlenecks
Introduction
As many of us know, the CROSSJOIN()
function is highly useful anytime we wish to generate a cross-product of
members in two different sets. The ability to specify "all possible
combinations" is convenient - indeed, the most straightforward way to
perform such a combination of two sets. Unfortunately, the indiscriminate use
of the CROSSJOIN() function, like many other MDX functions, can slow
reporting and analysis dramatically. This is often due to a failure to
understand how the function performs set combinations, and how its action can
lead to huge results datasets when applied to large cubes.
NOTE: For a detailed introduction to
the CROSSJOIN() function, see my Database Journal article Basic
Set Functions: The CrossJoin() Function.
In combining two sets, CROSSJOIN() combines every
member of the first set (all from a single dimension) with every member of the
second, creating a "Cartesian" effect as a result. Combining two
sets, for example, with the following query will illustrate a scenario, on the
scale of a decidedly small sample cube (the Warehouse sample that
installs with MSAS), where we experience a taste of the consequences when the
number of members in set 1, times the number of members in set 2, times the
member population in set 3, results in many combinations.
SELECT
{[Measures].[Warehouse Profit]} ON COLUMNS,
{CROSSJOIN([Warehouse].[Warehouse Name].Members,
CROSSJOIN([Store].[Store Name].Members,
[Product].[Product Name].Members))} ON ROWS
FROM
[WAREHOUSE]
This query would generate 898,560 combinations (24 individual
Warehouses times 24 Stores times 1,560 distinct Products).
A similar scenario, with the cube sizes we see these days, could be crippling
to performance. This is aggravated by the fact that sparsity is common enough in
large cubes, and CROSSJOIN() results on those cubes would likely have a much
higher sparsity factor. (The results dataset produced by the above query yields
only a tiny fraction of combinations (tuples) with non-empty measures.)
The process of generating all possible combinations, empty
or not, lies behind the performance drag, and becomes even more pronounced when
we go a step further and attempt to perform query operations, as we shall see
in the practice example, that must wait for the combinations to be assembled,
and then be applied to the resulting dataset. The time consumed in assembling
a large number of empty combinations has been wasted when they are "tossed"
in a subsequent step in the march toward the ultimate results.
For purposes of our practice procedure, we will assume that
we have been asked by management of a hypothetical client to investigate degradation
in performance of a query. The query was originally constructed at the request
of a group of information consumers in the Corporate Planning department,
shortly after the implementation of MSAS at the FoodMart organization. The
creator of the query, who initially wrote the MDX in a way that seemed
intuitive, intended to optimize it later. Unfortunately, he was laid off in the
wake of sudden moves by management to move all developer functions offshore.
Attempts to communicate with the offshore support team were
abandoned when it was learned that the building housing the group had been
destroyed in a natural disaster that had swept the region. (The inability
to obtain assistance with this issue paled, we are told, with the loss of
documentation, code, and other collateral for a host of enterprise projects in
process at the time of the disaster. Corporate financial information has also
"gone missing" - a fact that has been seized upon by the media in
short order.)
We listen closely to
the requirement. We then develop a plan to examine the query under
consideration, before offering options for improving its performance. As is
typically the case, we decide to work with a copy of the affected cube (in this
case, the Warehouse cube) to allow the original to remain isolated.