Introduction to Cube Storage

This article introduces storage in Analysis
Services
. Here, we will discuss various cube
design concepts and considerations, including partitions and aggregations, focusing upon their impact on
data storage and, to some extent, cube processing.

Note: For more
information about my MSSQL
Server Analysis Services column in general, see the section entitled “About the MSSQL Server Analysis Services Series” that follows the conclusion of
this article.

Introduction

In Dimensional Model Components: Dimensions Parts I and
II, we undertook a general
introduction to the dimensional model, noting its wide acceptance as the
preferred structure for presenting quantitative and other organizational data
to information consumers. As a part of our extended examination of dimensions,
we discussed the primary objectives of business intelligence, including
its capacity to support:

  • the presentation
    of relevant and accurate information representing business operations and
    events;
  • the rapid and
    accurate return of query results;
  • “slice and
    dice” query creation and modification;
  • an environment
    wherein information consumers can pose questions quickly and easily, and achieve
    rapid results datasets.

The
second objective above, the capacity of business intelligence to support “the
rapid and accurate return of query results”, translates to minimal querying
time
, of course. Storage design plays a key role in enhancing query
performance across our cubes, as we shall see in the sections that follow.

In this
article, we will introduce cube storage in general, and kick off an
extended examination of the topic in a subsequent, multiple-article subseries.
Within these articles, we will learn about Analysis Services storage
from several perspectives, including:

  • An
    introduction to Storage Mode concepts;
  • An
    introduction to Partitions;
  • Partition
    Planning
    ;
  • Hands-on sessions
    focused upon:
    • Creating Local
      Partitions
      within the Business Intelligence Development Studio;
    • Creating Remote
      Partitions
      within the Business Intelligence Development Studio;
    • Creating Partitions
      within SQL Server Management Studio;
    • Filtering Partitions;
    • Merging Partitions.
  • An
    introduction to Aggregations;
  • A practical
    examination of Aggregation Design and Application.

Introducing Cube Storage in Analysis Services

A classic
quandary that faces any OLAP designer lies between two often-conflicting needs:
the need to minimize cube processing time, and the need to minimize the
time required to process queries against the cube. Moreover, and at the
heart of reaching an optimal overall state, is the minimization of storage
space
, which contributes to acceptable cube processing times, and to
the support of rapid queries. As we will learn in this article, as well as the
other articles of this subseries, Analysis Services supports our
creation of an optimal storage design by providing storage mode options
among which we can pick to meet the needs of our environments. Analysis
Services
also provides us with a means of creating aggregations
within our cubes to support the rapid return of data at the levels and
juxtapositions that our businesses require. Finally, Analysis Services
allows us, as administrators, to modify the storage design of our cubes
after the cube has been deployed and is in production.

Storage Modes in Analysis Services

In Analysis Services, the term “storage” may be used
to narrowly refer to cube metadata only, or it may include all of the source
data from the fact table, as well as the aggregations defined by dimensions
related to the measure group. The amount of data stored depends,
therefore, upon the storage mode we select and the number of
aggregations we define. Because the amount of data stored directly affects
query performance, Analysis Services exposes / employs several
techniques for minimizing the space required for storage of cube data
and aggregations, including:

  • Storage
    options
    support
    our selection of the storage modes and locations that are most
    appropriate for our cube data.
  • A
    sophisticated algorithm supports the design of efficient summary aggregations
    to minimize storage without sacrificing speed.
  • Storage is not allocated for empty cells.

As we shall see in more detail, later in this subseries, storage
is defined on a partition-by-partition basis, and at least one partition
exists for each measure group in a cube.

Analysis Services provides a flexible and reliable architecture
within which storage is an important consideration. As developers and
administrators, we are afforded choices in the selection of storage mode
– choices that allow us to make the most of the trade-off between minimal cube storage
space
(and, thus, minimal cube processing time, as we mentioned
earlier) and minimal processing time for the queries we execute against
the cube (improved, in many cases, when we increase the number of cube aggregations,
and therefore the size of the cube). Since even before the advent of Analysis
Services 2005
(metadata information began to be stored as XML in Analysis
Services 2005
), we have been afforded choices (which, in the general sense,
remain the same) in how we store data and its aggregations. Simply stated, we
can choose between Analysis Services (where the most efficient data
retrieval and calculation capability is a given), the relational database, or
even “somewhere in between” as the location of data / data aggregation storage.
Our choice of storage mode affects the query and processing performance,
storage requirements, and storage locations of each partition,
as well as its parent measure group and cube. The choice of storage
mode
also affects processing choices, as we shall see.

Depending upon these choices, our storage options
for measure groups and dimensions – are, respectively, Multidimensional
OLAP
(MOLAP), Relational OLAP (ROLAP) or Hybrid OLAP (HOLAP).
(Analysis Services also supports proactive caching, which enables
us to combine the characteristics of ROLAP and MOLAP storage for
both immediacy of data and query performance. We discuss proactive caching
in other articles of this series.) As we shall see, when we are choosing the storage
mode
for a given cube that we are designing or maintaining, we are, in
essence, dictating the placement of the granular data and the data aggregations
involved with that cube.

Let’s take a look at each of the general storage modes
in detail.

MOLAP

In the MOLAP storage mode, both granular data (to
which we refer, at times, as the “details” or “detail data”) and aggregated
data (also known as “balances” or “summaries”) are stored within the Analysis
Services
instance, in a proprietary format that is designed for rapid data
retrieval and the efficient performance of – sometimes complex – calculations).
The MOLAP storage mode is the default in Analysis Services,
and offers the best query performance of the three general storage types.
This enhanced performance is due primarily to the highly efficient architectural
design of the indexed multidimensional data store (which is located on the Analysis
Server
) that underlies it. Additionally, query response times can be
decreased substantially by using aggregations within a MOLAP data store
in Analysis Services, as we have mentioned. Moreover, because the
formatted data is housed within the Analysis Server, it does not need,
in a query retrieval action, to be moved to the server from another physical
location, as would be the case when the data has to be retrieved, as an
example, across a network, from a relational data store.

Some disadvantages accompany the choice of the MOLAP storage
mode
. These include an increase in the demand for overall system storage,
as the detail data, wherever it is housed, is duplicated within the cube.
Moreover, the additional detail data increases cube processing time
(although it means decreased query processing time, as access to the
relational data source is not required, once cube processing has been
accomplished). Finally, because the data within a cube is isolated from the
underlying relational data, changes in the relational data are not reflected
within the MOLAP data store until the cube is reprocessed, which can
obviously mean that the two data stores can become out of sync. The time
between cube processing cycles creates a “latency” period during which
data in OLAP objects may not match the source data, so that the data in the
cube is only as current as its most recent processing event.

The latency
consideration is mitigated, somewhat, by the ease with which Analysis
Services
allows us to incrementally or fully update objects in MOLAP
storage without taking the involved partition or cube offline. Although there
are situations that may require us to take a cube offline to process certain
structural changes to OLAP objects, we can minimize the downtime required to
update MOLAP storage by implementing a “promotion” process, whereby we
update and process cubes on a staging server, and then use database
synchronization to copy the processed objects to the production server. We can
also use proactive caching to minimize latency and maximize
availability while retaining much of the performance advantage of MOLAP
storage. (For more information about Proactive Caching, see other
articles within my Introduction to MSSQL Server
Analysis Services
series.)

ROLAP

In the ROLAP storage mode, the key word is
“relational:” the detail data, along with aggregated / summary data (stored
within indexed views designed for that purpose), is housed within a relational
database that is specified within Analysis Services. Unlike the MOLAP
storage mode
, ROLAP does not cause a duplicate of the source data to
be stored in the Analysis Services data folders. Instead, when results
cannot be derived from the query cache, the indexed views in the data source
are accessed to answer queries.

Queries to retrieve data against Analysis Services
are translated into one or more queries that are then redirected against the underlying
relational database, to return the specified data that resides there. This is
why the ROLAP storage mode offers the worst performance with regard to
query processing. Cube processing is also typically slower with ROLAP.

Among the few post-Analysis Services 2000 scenarios
where the ROLAP storage option might make sense are situations where we
have one or more very large dimensions whose millions of members experience
perpetual changes. One of the few advantages offered by ROLAP – an
advantage that is abated to a large extent in this era of inexpensive storage
– is that it allows us an ability to handle cubes whose size is limited only by
the underlying relational database. Moreover, ROLAP enables users to
view data in real time and can save storage space when in instances
where we are working with large datasets that are infrequently queried, such as
purely historical data and so forth. Because of its decreasing utility, it
would seem that the ROLAP option is destined for relative obscurity in
coming years.

HOLAP

In the HOLAP storage mode, attributes of the MOLAP
and ROLAP modes are combined. Like MOLAP, HOLAP causes
data aggregations to be stored in a multidimensional structure within an Analysis
Services
instance, but HOLAP does not cause a copy of the source
data to be stored. For this reason, when queries access only summary data in aggregations,
HOLAP is the equivalent of MOLAP.

In the HOLAP storage mode, detail data remains in
the relational data source. Queries that access detail data – for example, if
we want to drill down to an atomic detail underlying an aggregation in a
cube – we must retrieve data from the relational database. Query processing
will not be as fast as it would be if the detail data were stored in the MOLAP
structure. With the HOLAP storage mode in place, information consumers
will typically experience substantial differences in query retrieval times,
depending upon whether the query can be resolved from cache or aggregations,
versus from the relational source data itself.

Partitions stored as HOLAP
are smaller than the equivalent MOLAP partitions because they do not
contain source data. HOLAP partitions therefore respond faster than ROLAP
partitions for queries requesting summary data. HOLAP storage mode is
generally suited for partitions in cubes that require rapid query response for summaries
based on a large amount of source data. However, where users generate queries
that must touch leaf level data, such as for calculating median values, MOLAP
is generally a better choice.

We will examine the properties,
and the associated settings, that we use in designing and maintaining storage
in Analysis Services in subsequent articles of this column, where we
will gain hands-on exposure to these in a working environment.

Conclusion

In this
article, we introduced cube storage in general, and kicked off a multiple-article
subseries wherein we will perform an extended examination of the topic. We
discussed the classic quandary that faces any OLAP designer: the need to
minimize cube processing time, and the need to minimize the time
required to process queries against the cube. We then touched upon the
consideration at the heart of reaching an optimal overall state, the
minimization of storage space, which results in acceptable cube processing
times and the support of rapid queries.

We noted
that Analysis Services supports our creation of an optimal storage
design
by providing storage mode options among which we can pick to
meet the needs of our environments, and that Analysis Services allows
us, as administrators, to modify the storage design of our cubes after
the cube has been deployed, and is in production. We then introduced the Analysis
Services
storage modes in general – Multidimensional OLAP
(MOLAP), Relational OLAP (ROLAP) and Hybrid OLAP (HOLAP) – and
then explored each individually, citing general characteristics, as well as advantages
and disadvantages associated with its use. Finally, we looked forward to
subsequent storage–related articles, where we will learn about Analysis
Services
storage from several perspectives, including:

  • An
    introduction to Partitions;
  • Partition
    Planning
    ;
  • Hands-on sessions
    focused upon:
    • Creating Local
      Partitions
      within the Business Intelligence Development Studio;

    • Creating Remote
      Partitions
      within the Business Intelligence Development Studio;

    • Creating Partitions
      within SQL Server Management Studio;

    • Filtering Partitions;

    • Merging Partitions.
  • An
    introduction to Aggregations;
  • A practical
    examination of Aggregation Design and Application.

About the Series …

This
article is a member of the series Introduction
to MSSQL Server Analysis Services
. The monthly column is designed to provide hands-on
application of the fundamentals of MS SQL Server Analysis Services (“Analysis
Services
”), with each installment progressively presenting features and
techniques designed to meet specific real-world needs. For more information on
the series, please see my initial article, Creating Our First Cube.

»


See All Articles by Columnist
William E. Pearson, III

William Pearson
William Pearson
Bill has been working with computers since before becoming a "big eight" CPA, after which he carried his growing information systems knowledge into management accounting, internal auditing, and various capacities of controllership. Bill entered the world of databases and financial systems when he became a consultant for CODA-Financials, a U.K. - based software company that hired only CPA's as application consultants to implement and maintain its integrated financial database - one of the most conceptually powerful, even in his current assessment, to have emerged. At CODA Bill deployed financial databases and business intelligence systems for many global clients. Working with SQL Server, Oracle, Sybase and Informix, and focusing on MSSQL Server, Bill created Island Technologies Inc. in 1997, and has developed a large and diverse customer base over the years since. Bill's background as a CPA, Internal Auditor and Management Accountant enable him to provide value to clients as a liaison between Accounting / Finance and Information Services. Moreover, as a Certified Information Technology Professional (CITP) - a Certified Public Accountant recognized for his or her unique ability to provide business insight by leveraging knowledge of information relationships and supporting technologies - Bill offers his clients the CPA's perspective and ability to understand the complicated business implications and risks associated with technology. From this perspective, he helps them to effectively manage information while ensuring the data's reliability, security, accessibility and relevance. Bill has implemented enterprise business intelligence systems over the years for many Fortune 500 companies, focusing his practice (since the advent of MSSQL Server 2000) upon the integrated Microsoft business intelligence solution. He leverages his years of experience with other enterprise OLAP and reporting applications (Cognos, Business Objects, Crystal, and others) in regular conversions of these once-dominant applications to the Microsoft BI stack. Bill believes it is easier to teach technical skills to people with non-technical training than vice-versa, and he constantly seeks ways to graft new technology into the Accounting and Finance arenas. Bill was awarded Microsoft SQL Server MVP in 2009. Hobbies include advanced literature studies and occasional lectures, with recent concentration upon the works of William Faulkner, Henry James, Marcel Proust, James Joyce, Honoré de Balzac, and Charles Dickens. Other long-time interests have included the exploration of generative music sourced from database architecture.

Latest Articles