Cube Storage: Introduction to Partitions

This article introduces partitions in
Analysis Services. Here, we will introduce partitions
and discuss their characteristics and considerations surrounding their use, as
well as their impact
on data storage and cube processing.

Note: For more information about my MSSQL Server Analysis
Services
column
in general, see the section entitled “About the MSSQL Server Analysis Services Series” that follows the
conclusion of this article.

Introduction

In Dimensional Model Components: Dimensions Parts I and
II, we undertook a general
introduction to the dimensional model, noting its wide acceptance as the
preferred structure for presenting quantitative and other organizational data
to information consumers. As a part of our extended examination of dimensions,
we discussed the primary objectives of business intelligence, including its
capacity to support:

  • the presentation
    of relevant and accurate information representing business operations and
    events;
  • the rapid and
    accurate return of query results;
  • “slice and
    dice” query creation and modification;
  • an environment
    wherein information consumers can pose questions quickly and easily, and achieve
    rapid results datasets.

We noted
in Cube
Storage: Introduction
that the
second objective above, the capacity of business intelligence to support “the
rapid and accurate return of query results”, translates to minimal querying
time. We discussed that storage design plays a key role in enhancing query performance
across our cubes, and, as we will learn in this article, partitions play a
significant role in the way that Analysis Services manages and stores data and
aggregations for a measure group in a cube.

In this
article, we will continue the general exploration of cube storage that we began
in Cube Storage: Introduction, this
time focusing upon partitions. Our introduction to partitions here will lead
to more detailed exploration of various concepts surrounding partitions in
subsequent articles that examine partition planning, as well as hands-on
sessions focused upon various tasks surrounding partitions, including:

  • the creation
    of local partitions within the Business Intelligence Development Studio;
  • the creation
    of multiple partitions for a single measure group, based upon views in the
    underlying relational database;
  • the creation
    of multiple partitions for a single measure group, based upon named queries in Analysis
    Services;
  • the creation
    of remote partitions within the Business Intelligence Development Studio;
  • the creation
    of partitions within SQL Server Management Studio;
  • filtering partitions;
  • merging partitions.

Introducing Partitions in Analysis Services

Because
the data sources underlying our cubes, and even the cubes themselves, can
become very large in physical size, storage becomes a significant consideration
within cube design strategy. A partition is a physical file on a hard disk
that contains a subset of the data included in an Analysis Services database. Analysis
Services uses partitions to manage and store data and aggregations for a measure
group in a cube.

Partitions
make it possible for us to spread data over multiple hard disks, should data
growth or other factors dictate the need and / or convenience for doing so.
This spreading of data can include local partitions (which are stored locally
on hard disk), remote partitions (which are distributed across multiple hard
disks), or a combination of the two types. Partitions rely on storage settings
to define the format and processing schedule for the database, and they use
writeback settings to enable “what-if” analysis.

Local and Remote Partitions

For each measure
group we create within a given cube, a single partition, containing all data
and metadata within and about the measure group, is created to support that measure
group. Although we begin cube creation with a single partition for each measure
group, however, we begin to experience performance deterioration as a partition
grows larger. We can typically reduce the time required to process and query
the cube by dividing a single large partition into smaller partitions. We do
this through the addition of explicitly created partitions to the existing partition
– partitions across which the existing data can be spread. (When we create a
new partition for a measure group, the new partition is added to the set of partitions
that already exist for the measure group.)

A
given measure group reflects the combined data that is contained in all
its partitions. When establishing multiple partitions, therefore, we must
ensure that the data for any partition in a measure group excludes the data for
any other partition in the measure group, to ensure that data is not “double
counted” in the measure group. The original partition, created when we create a
given measure group, is based on a single fact table in the data source view of
the cube. When multiple partitions support a measure group, each partition can
reference a different table in either the data source view or in the underlying
relational data source for the cube. While more than one partition in a measure
group can reference the same table, each partition must be restricted, through filtering
or another method, to different rows in the table to prevent the double
counting we have mentioned.

NOTE: We explore filtering, and other means of restricting the data that is
stored in a partition, in other articles of this subseries.

When we
spread the data across multiple drives on the same server, we refer to the
resulting partitions as local partitions. When we spread the data over multiple
machines, we are establishing remote partitions.

The Benefits of Partitioning Measure
Groups

The processing
time required for large measure groups can be reduced when we partition those
groups, because processing can then be undertaken in parallel across the partitions.
(Parallel processing means faster execution, primarily because the processing
of one partition does not have to finish before the processing of another can
start; more than one processing job can run at the same time, typically utilizing
processer capacity more efficiently.) And when we distribute the data over
multiple machines with remote partitions, we not only provide more physical
room for large volumes of data, but we make it possible for multiple computers
to process the data in parallel.

It is
easy to see how partitions afford us a powerful and flexible means of managing large
cubes. For example, a cube that contains financial information can contain a partition
for the data of each past year, together with partitions for each month of the
current year. In general, only the current monthly partition would require
processing when current information is added to the cube. Because we would be processing
a significantly smaller amount of data, processing performance would be enhanced,
perhaps dramatically, by the decreased time required. At the end of the year
the twelve monthly partitions could be merged into a single partition for the
year to which they belong, and a new partition could be created for the first month
of the new year. (We gain hands-on exposure to merging partitions in an
independent article of this subseries.) Moreover, this new partition creation
process could be automated as part of our data warehouse loading and cube
processing procedures.

Although partitions
are not visible to business users of a cube, administrators can easily configure,
add, or drop partitions. Each partition is physically stored in a separate set
of files. The aggregate data of each partition can be stored on the instance of
Analysis Services where the partition is defined, on another instance of Analysis
Services, or in the data source that is used to supply the partition’s source
data. As we have noted, partitions allow the source data and aggregate data of
a cube to be distributed across multiple hard drives and among multiple server
computers. For a cube of moderate to large size, partitions can greatly improve
query performance, load performance, and ease of cube maintenance.

The
storage mode of each partition can be configured independently of other partitions
in the measure group. Partitions can be stored by using various combinations of
options for source data location, storage mode, proactive caching, and
aggregation design. Options for real-time OLAP and proactive caching allow us to
balance query speed against latency when we design a partition. Storage options
can also be applied to related dimensions and to facts in a measure group. This
flexibility enables us to design cube storage strategies appropriate to the
needs of our environments.

NOTE: For more information on storage modes, see Cube Storage:
Introduction
, the
initial article of this subseries of my monthly Introduction
to MSSQL Server Analysis Services
series here at Database Journal.

The Content and Structure of Partitions

Partitions
are physical “containers” housing a subset of the data of a measure group. Partitions
are not visible to MDX queries, and are not apparent in cube browsers or
reporting applications. Regardless of the number of partitions that are
defined for a given measure group, these tools reflect the whole content of the
measure group.

A
simple partition is composed of:

  • Basic
    Information – including the partition name, its storage mode, the processing
    mode, and other information.
  • Slicing
    Definition – an MDX expression, specifying a tuple or a set, which has
    identical restrictions to the StrToSet() MDX function – that is, together with
    the CONSTRAINED parameter, the slicing definition can employ dimension,
    hierarchy, level and member names, keys, unique names, or other named objects
    in the cube, but cannot use MDX functions.
  • Aggregation Design
    – a collection of aggregation definitions that can be shared across multiple partitions.
    (The default is taken from the parent cube’s aggregation design).

The
structure of a given partition must match the structure of the measure group
that it supports, which means that the measures that define the measure group
must also be defined in the partition, along with all related dimensions. It is
for this reason that, when a partition is created, it automatically inherits
the same set of measures and related dimensions that are defined for the measure
group (whose creation triggers the partition’s creation).

Each
partition in a measure group can have a different fact table, and these fact
tables can exist within different data sources. When different partitions in a measure
group have different fact tables, the tables must be sufficiently similar to
maintain the structure of the measure group (which means that the processing
query returns the same columns and data types for all fact tables for all partitions).
When fact tables for different partitions are from different data sources, the
source tables for any related dimensions, and also any intermediate fact
tables, must also be present in all data sources and must have the same
structure in all the databases. Also, all dimension table columns that are used
to define attributes for cube dimensions related to the measure group must be
present in all of the data sources. There is no need to define all the joins
between the source table of a partition and a related dimension table if the partition
source table has the identical structure as the source table for the measure
group.

Columns
that are not used to define measures in the measure group can be present in
some fact tables but absent in others. Similarly, columns that are not used to
define attributes in related dimension tables can be present in some databases
but absent in others. Tables that are not used for either fact tables or
related dimension tables can be present in some databases but absent in others.

Data Sources and Partition Storage

A
partition is based either on a table or view in a data source, or on a table or
named query in a data source view. (We examine the setup of each within
independent articles of this subseries.) The location where partition data is
stored is defined by the data source binding. Typically, we can partition a measure
group horizontally or vertically:

  • In a horizontally
    partitioned measure group, each partition in a measure group is based on a
    separate table. This approach to partitioning is appropriate when data is
    separated into multiple tables. As an illustration, some relational databases
    have a separate table for each month’s data.
  • In a vertically
    partitioned measure group, a measure group is based on a single table, and each
    partition is based on a source system query that filters the data for the partition.
    For example, if a single table contains several months’ data, the measure group
    could still be partitioned by month by applying a Transact-SQL WHERE clause
    that returns a separate month’s data for each partition.

As
we mentioned earlier, each partition has storage settings that determine
whether the data and aggregations for the partition are stored in the local
instance of Analysis Services or in a remote partition using another instance
of Analysis Services. The storage settings can also specify the storage mode
and whether proactive caching is used to control latency for a partition.

Precautions with Partitions

Anytime
we create and manage multiple-partition measure groups, we have to take precautions
to guarantee that cube data is accurate. Although these precautions do not
usually apply to single-partition measure groups, they do apply when we incrementally
update partitions. It is important to understand that, when we incrementally
update a partition, a new temporary partition is created that has a structure
identical to that of the source partition – this partition contains the incremental,
or “delta,” data. The temporary partition is processed and then merged with
the source partition. Therefore, we must ensure that the processing query that
populates the temporary partition does not duplicate any data already present
in an existing partition. (We explore the concepts surrounding, and get some
hands-on exposure to performing, both filtering and partition merging in other
articles of this subseries.)

We will examine many of the
properties, and the associated settings, that we use in creating and
maintaining partitions in Analysis Services in subsequent articles of this monthly
column, where we will gain hands-on exposure to these in a working environment.

Conclusion

In this
article, we continued the general exploration of cube storage that we began in Cube Storage: Introduction, this time focusing
upon partitions. Our introduction to partitions is intended to serve as a
lead-in to more detailed exploration of various concepts surrounding partitions
in subsequent, independent articles that examine partition planning, as well as
hands-on sessions focused upon various tasks surrounding partitions.

We
explored the concepts of local and remote partitions, and then discussed the
benefits we can expect to accrue when we partition the measure groups of our
cubes. We next focused upon the content and structure of partitions, and then
examined considerations surrounding data sources and partition storage.
Finally, we touched upon precautions that we need to keep in mind when we create
and manage multiple-partition measure groups, especially when we perform incremental
updates.

Throughout
our introduction to partitions in Analysis Services, we looked forward to subsequent
partition–related articles, where we will gain hands-on exposure to various
tasks involved in the creation and maintenance of partitions, including:

  • the creation
    of local partitions within the Business Intelligence Development Studio;
  • the creation
    of multiple partitions for a single measure group, based upon views in the
    underlying relational database;
  • the creation
    of multiple partitions for a single measure group, based upon named queries in Analysis
    Services;
  • the creation
    of remote partitions within the Business Intelligence Development Studio;
  • the creation
    of partitions within SQL Server Management Studio;
  • filtering partitions;
    and
  • merging partitions.

About the Series …

This
article is a member of the series Introduction
to MSSQL Server Analysis Services
. The monthly column is designed to provide hands-on
application of the fundamentals of MS SQL Server Analysis Services (“Analysis
Services”), with each installment progressively presenting features and
techniques designed to meet specific real-world needs. For more information on
the series, please see my initial article, Creating Our First Cube.

»


See All Articles by Columnist
William E. Pearson, III

William Pearson
William Pearson
Bill has been working with computers since before becoming a "big eight" CPA, after which he carried his growing information systems knowledge into management accounting, internal auditing, and various capacities of controllership. Bill entered the world of databases and financial systems when he became a consultant for CODA-Financials, a U.K. - based software company that hired only CPA's as application consultants to implement and maintain its integrated financial database - one of the most conceptually powerful, even in his current assessment, to have emerged. At CODA Bill deployed financial databases and business intelligence systems for many global clients. Working with SQL Server, Oracle, Sybase and Informix, and focusing on MSSQL Server, Bill created Island Technologies Inc. in 1997, and has developed a large and diverse customer base over the years since. Bill's background as a CPA, Internal Auditor and Management Accountant enable him to provide value to clients as a liaison between Accounting / Finance and Information Services. Moreover, as a Certified Information Technology Professional (CITP) - a Certified Public Accountant recognized for his or her unique ability to provide business insight by leveraging knowledge of information relationships and supporting technologies - Bill offers his clients the CPA's perspective and ability to understand the complicated business implications and risks associated with technology. From this perspective, he helps them to effectively manage information while ensuring the data's reliability, security, accessibility and relevance. Bill has implemented enterprise business intelligence systems over the years for many Fortune 500 companies, focusing his practice (since the advent of MSSQL Server 2000) upon the integrated Microsoft business intelligence solution. He leverages his years of experience with other enterprise OLAP and reporting applications (Cognos, Business Objects, Crystal, and others) in regular conversions of these once-dominant applications to the Microsoft BI stack. Bill believes it is easier to teach technical skills to people with non-technical training than vice-versa, and he constantly seeks ways to graft new technology into the Accounting and Finance arenas. Bill was awarded Microsoft SQL Server MVP in 2009. Hobbies include advanced literature studies and occasional lectures, with recent concentration upon the works of William Faulkner, Henry James, Marcel Proust, James Joyce, Honoré de Balzac, and Charles Dickens. Other long-time interests have included the exploration of generative music sourced from database architecture.

Latest Articles