Business Intelligence : Your Data Storage

POS Applications, HR Applications, Customer Survey results; these
are just some of the myriad sources of data that we are, as database
administrators or developers responsible for – how do we store that data so
that a Business Intelligence system can interrogate and provide meaningful
results to the user.

Business Intelligence – Data Storage

POS Applications, HR
Applications, Customer Survey results; these are just some of the myriad
sources of data that we are, as database administrators or developers
responsible for and that a business intelligence system can consume. Within
Business Intelligence systems, these sources of data are usually encompassed in
to two main types of storage systems to provide historical, current, and
predictive views of business operations.

The Data Warehouse

In a repository of an
organization’s electronically stored data, data warehouses are designed to
facilitate reporting and analysis. The need for a data warehouse is driven by
an organization’s need for reliable, consolidated, unique and integrated
reporting and analysis of its data, at different levels of aggregation. The
practical reality of most organizations is that their data infrastructure is made
up by a collection of heterogeneous systems. For example, an organization might
have one system that handles service quality, one that handles employees, and
others that handle sales data or production data. In practice, these systems
are often poorly or not at all integrated and simple questions like "How
many customers are complaining about branch A and have a credit account and
which employees are being targeted" can be very hard to answer, even
though the information is available "somewhere" in the different data
systems.

It is partly the purpose of data
warehousing to bridge such problems but also to make data appear consistent,
integrated and consolidated despite the problems in the underlying source
systems. The data warehouse achieves this by employing techniques, creating new
data repositories (i.e. the data warehouse) whose data model(s) support the
needed reporting and analysis.

Data Mart

A data mart should be viewed as
a subset of an organizational data store, usually oriented to a specific purpose
or major data subject that may be distributed to support business needs. Data
marts are analytical data stores designed to focus on specific business
functions for a specific community within an organization. Data marts are often
derived from subsets of data in a data warehouse, though in the bottom-up data
warehouse design methodology the data warehouse is created from the union of
organizational data marts.

Reasons for creating a data mart

  • Access
    to frequently needed data
  • Creates
    collective view by a group of users
  • End-user
    response time
  • Ease of
    creation
  • Lower
    cost than implementing a full Data warehouse
  • Users
    are more clearly defined than in a full Data warehouse

The Real World

In practice, the terms data
mart and data warehouse each tend to imply the presence of the other in some
form. However, most writers using the term seem to agree that the design of a
data mart tends to start from an analysis of user needs and that a data
warehouse tends to start from an analysis of what data already exists and how
it can be collected in such a way that the data can later be used. A data
warehouse is a central aggregation of data (which can be distributed
physically); a data mart is a data repository that may or may not derive from a
data warehouse and that emphasizes ease of access and usability for a
particular designed purpose. In general, a data warehouse tends to be a
strategic but somewhat unfinished concept; a data mart tends to be tactical and
aimed at meeting an immediate need.

One writer, Marc Demarest,
suggests combining the ideas into a Universal Data Architecture (UDA). In
practice, many products and companies offering data warehouse services also
tend to offer data mart capabilities or services.

Other emerging technologies are
being combined into new tools and software, which enables dynamic querying of
real time data utilizing SOA to develop composable and adaptive middleware –
there is a drive to name this type of BI as Business Intelligence 2.0.
This renaming is has been attributed by Neil Raden as an attempt to imply a
move away from standard storage systems (Data Warehouse and Data Mart) that
current systems employ.

Design schemas

The Star

The star
schema
(sometimes referenced as star join schema) is the simplest style
of data warehouse schema. The star schema consists of a few fact tables
(possibly only one, justifying the name) referencing any number of dimension
tables. The star schema is considered an important special case of the
snowflake schema.

Model

Dimension tables have a simple
primary key, while fact tables have a set of foreign keys, which make up a
compound primary key consisting of a combination of relevant dimension keys. It
is common for dimension tables to consolidate redundant data in the most
granular column, and is rendered in second normal form. Fact tables are usually
in third normal form because all data depends on either one dimension or all of
them, not on combinations of a few dimensions. The star schema is a way to
implement multi-dimensional database (MDDB) functionality using a mainstream
relational database: given the typical commitment to relational databases of
most organizations, a specialized multidimensional DBMS is likely to be both
expensive and inconvenient. The facts that the data warehouse helps analyze are
classified along different dimensions: the fact tables hold the main data,
while the usually smaller dimension tables describe each value of a dimension
and can be joined to fact tables as needed. Another reason for using a star
schema is its simplicity from the users’ point of view: queries are never
complex because the only joins and conditions involve a fact table and a single
level of dimension tables, without the indirect dependencies to other tables
that are possible in a better normalized snowflake schema.

Business Intelligence, the
primary consumer of Star Schema data, is also best expressed in Business
English, not programmer’s dialect. The aggregate navigators (OLAP) tools that
are common in the industry do not need to subsequently rename the elements,
because they are already in proper business English. Most SQL database engines
allow schemata descriptors, and also permit decoration suffixes on surrogate
keys columns. Using square brackets, which are physically easier to type on the
keyboard (no shift key needed) are not intrusive and make the code easier to
read.

For example, the following
query extracts how many SUVs have been sold, for each brand and country, in
1997.

SELECT Brand, Country, SUM ([Units Sold])
FROM Fact.Sales (NOLOCK)
 
 JOIN Dim.Date (NOLOCK)
 ON Date_FK = Date_PK
 
 JOIN Dim.Store (NOLOCK)
 ON Store_FK = Store_PK
 
 JOIN Dim.Product (NOLOCK)
 ON Product_FK = Product_PK
 
WHERE [Year] = 1997
AND [Product Category] = 'SUV'
GROUP BY Brand, Country

The Snowflake

A snowflake
schema
is a logical arrangement of tables in a multidimensional database
such that the entity relationship diagram resembles a snowflake in shape.
Closely related to the star schema, the snowflake schema is represented by
centralized fact tables, which are connected to multiple dimensions. In the
snowflake schema, however, dimensions are normalized into multiple related
tables whereas the star schema’s dimensions are denormalized with each
dimension being represented by a single table. When the dimensions of a snowflake
schema are elaborate, have multiple levels of relationships, and where child
tables have multiple parent tables ("forks in the road"), a complex
snowflake shape starts to emerge. The "snowflaking" effect only
affects the dimension tables and not the fact tables.

Data normalization and storage

Normalization splits up data to
avoid redundancy (duplication) by moving commonly repeating groups of data into
a new table. Normalization therefore tends to increase the number of tables
that need to be joined in order to perform a given query, but reduces the space
required to hold the data and the number of places where it needs to be updated
if the data changes. From a space storage point of view, the dimensional tables
are typically small compared to the fact tables. This often removes the storage
space benefit of snowflaking the dimension tables, as compared with a star
schema. Some database developers compromise by creating an underlying snowflake
schema with views built on top of it that perform many of the necessary joins
to simulate a star schema. This provides the storage benefits achieved through
the normalization of dimensions with the ease of querying that the star schema
provides. The tradeoff is that requiring the server to perform the underlying joins
automatically can result in a performance hit when querying as well as extra
joins to tables that may not be necessary to fulfill certain queries.

Benefits of "snowflaking"

  • Some
    OLAP multidimensional database modeling tools that use dimensional data marts
    as a data source are optimized for snowflake schemas.
  • If a
    dimension is very sparse, (i.e. most of the possible values for the dimension
    have no data) and/or a dimension has a very long list of attributes, which may
    be used in a query, the dimension table may occupy a significant proportion of
    the database and snowflaking may be appropriate.
  • A
    multidimensional view is sometimes added to an existing transactional database
    to aid reporting. In this case, the tables, which describe the dimensions, will
    already exist and will typically be normalized. A snowflake schema will
    therefore be easier to implement.
  • A
    snowflake schema can sometimes reflect the way in which users think about data.
    Users may prefer to generate queries using a star schema in some cases, although
    this may or may not be reflected in the underlying organization of the
    database.
  • Some
    users may wish to submit queries to the database which, when using conventional
    multidimensional reporting tools, cannot be expressed within a simple star
    schema. This is particularly common in data mining of customer databases, where
    a common requirement is to locate common factors between customers who bought
    products meeting complex criteria. Some snowflaking would typically be required
    to permit simple query tools to form such a query, especially if provision for
    these forms of query weren’t anticipated when the data warehouse was first
    designed.

Which schema to use

Your decision whether to employ
a star schema or a snowflake schema should consider the relative strengths of
the database platform in question and the query tool to be employed. The Star
schema should be favored with query tools that largely expose users to the
underlying table structures and in environments where most queries are simpler
in nature. The Snowflake schema is often better with more sophisticated query
tools that isolate users from the raw table structures and for environments
having numerous queries with complex criteria.

Additional Resources

Data Storage Index

Microsoft Data Storage and Management

MySQL MySQL in Data Warehousing & Business Intelligence

»


See All Articles by Columnist

Peter Evans

Peter Evans
Peter Evans
Peter Evans, a Business Intelligence and Data Warehousing Expert, Targit Certified Professional and industry recognized independent consultant specializing in delivery of applications utilizing primarily but not exclusively Microsoft technologies and in delivery of solutions to non standard cases. He enjoys explaining the methods e has employed in over sixteen years industry experience including work for major corporation and government clients.

Latest Articles