A data mart provides the primary access to the data stored in the data warehouse or operational data store. It is a subset of data sourced from the data warehouse or operational data store specifically focused on a business function or set of related business functions. Read on to learn the answers to fundamental questions about data marts.
We did it! We built the data warehouse, we created the ETL framework, we identified the data sources, we even created the metadata repository and the ODS; so now we have received a request to build a data mart. The business users want to be able to access data in this great data warehouse. They spent a significant amount of time, resources and patience to build a data warehouse environment that would stand the test of time and they want to be able to get their money’s worth.
So how do we start? Better yet, where do we start? I think a good place to start is to understand what a data mart is. So what is a data mart? Well, through the business intelligence framework, a data mart provides the primary access to the data stored in the data warehouse or operational data store. It is a subset of data sourced from the data warehouse or operational data store specifically focused on a business function or set of related business functions. The rest of this article will provide answers to fundamental questions about data marts.
What is the purpose of a data mart?
The fact is that the data warehouse is a repository of centralized, cleansed, standardized, transformed and integrated historical data. It is the organizational view of its business from a data perspective. However, it is an extremely difficult task to access or report on data within a data warehouse. The data in a data warehouse is organized by subject area; it’s historical and represents the organization as a whole. In the same way the operational data store (ODS) has the characteristics of the data warehouse with the most important exception that the data in an ODS is NON-HISTORICAL. Think of what that set of SQL statements would look like to accomplish this task! Lots of complex joins! Performance degradation related to the volume of data that must be read to satisfy the criteria in the SQL statements! Phew.
That’s where the data mart serves a purpose. It provides the business unit with easy access to data in the warehouse or the operational data store. Why? Because the data in the data mart is specifically tailored to the business unit. It is organized based on specific business requirements. The primary purpose for its existence is to support data dependent business functions related to reporting, trending, analysis and presentation.
Are there types of data marts?
Data marts are created to support the needs and requirements of the business unit. These requirements are the driving force behind the type of data mart built. So let’s take a look at some of the common and fundamental business requirements and the type of data mart that works well with the solution.
The most common of data mart types is the star schema data mart. A star schema is a fact table and a set of dimension tables and the relationships between the two sets. What is a fact table? A fact table is a set of metrics or measures generated that relates to a specific business process or event. An example of a fact table is “Claims Paid Large Amount for Marine Craft”. This fact table is exactly what it says; it contains claims paid in certain amounts for marine policies on marine craft. A dimension table is a set of elements that describe the event of metrics in the fact table. A dimension table contains the attributes that places the metrics in the fact table into a particular context. An example of a dimension table is Accident State. Star schema data marts are simple and very cost effective to deploy and provide multidimensional functionality within a relational database. Its design lends itself best to trend and historical reporting and is supported by most DBMSs including SQL Server, UDB and Oracle.
A snowflake schema data mart is a star schema with dimension tables that have been normalized. This type of data mart works best in situations where dimension tables are too large (most of the columns in the dimension table are populated with data values) or too sparse (few data values in columns of the dimension tables. A snowflake schema will have the fact table; however, the dimension tables may have parent tables that they are related to. This design enables the efficient use of storage, as there is a significant reduction in data redundancy. However, maintenance is more complicated as there are more tables to populate and synchronize. A snowflake schema works well where the dimensions are not clearly defined.
A data or OLAP cube is a very interesting solution. It is a multidimensional database that is specifically architected to support all flavors of reporting including analytics, trending and historical reporting. It is organized into measures (facts) and dimensions. However, this type of database is optimized to report on data from different views and perspectives. It provides the ability to slice and dice, drill through to atomic detail and discover patterns in data. A data or OLAP cube is best suited for solutions where efficient reporting with complex business requirements is mandatory. It works best with sophisticated business intelligence tools that possess features and functionality that enables reporting from cubes. Software solutions such as Business Objects, Cognos, Hyperion and SQL Server Analysis Services (SSAS) are perfect examples here. However, the design, maintenance and processing of the cube should not be under estimated. The cube build itself is quite resource intensive, therefore requires robust hardware and software designed for this type of workload. Trust me on this one. I have worked on reporting applications where the cube build process would fail at least once a week! Why? For a couple of reasons, the first being the design of the some of the measures and the capacity of the server, including limited memory and processing power.
Are there Design Rules for Data Marts?
In terms of design rules for an OLAP cube, a set of fundamentals hold true. Here are a few that I would like to share.
Based on the business requirements, the team, through careful analysis, must identify the sources that will be used to build the cube. Ideally the subject area from a data warehouse or ODS should be sufficient as it contains the entire organization’s data assets.
Design the dimensions carefully and appropriately. They are an extremely important part of the data mart solution as they provide the context for the metrics in the fact table. Use conformed dimensions wherever possible. A conformed dimension is an entity where its definition is shared among business units. An example of this is geographic region dimension. There is only one set of states in the United States of America; this should be a common or shared definition of geography.
Lastly, remember, “Just say no to nulls”. A data mart is a subset of data from a data warehouse. This data is a recording of events that occurred from sets of business transactions. These transactions all have valid data values that are something other than nulls, zeroes or spaces. So therefore, the data mart should be a reflection of that.
One Last Thing
A data mart is the mechanism through which the business community can access and view the data that has been transformed and enriched from a number of transactional systems across the enterprise into a holistic view of the organization’s data assets. Care must be taken in designing an appropriate solution that aligns to business requirements and the strategic direction of the organization at large.