The staging area tends to be one of the more overlooked components of a data warehouse architecture, and yet it is an integral part of the ETL component design. Learn why it is best to design the staging layer right the first time, enabling support of various ETL processes and related methodology, recoverability and scalability.
In any data warehousing
initiative, there are several common components to the architecture. There are
the data sources and targets, ETL framework, infrastructure, application layer
and the data staging area.
The staging area, in my
experience has to be one of the more overlooked and underestimated components
of a data warehouse architecture. I think mostly this is due to a lack of
understanding as to what exactly it is.
If a quick search is made
through a number of websites, many definitions will include the fact the data
staging area is simply a temporary workspace used to transform and enrich data
before it flows into the operational data store (ODS) and the data warehouse.
This is a good fundamental
definition of the data staging area. However, it is so much more. How much more
do you ask? Well in reality, the data staging area is an information hub that
facilitates the enriching stages that data goes through in order to populate an
ODS and/or data warehouse. It is the essential ingredient in the development of
an approach and/or methodology for creating a comprehensive data-centric
solution for any data warehousing project.
If we really think about
this, the data staging area is an integral part of the ETL component design and
is the foundation for the ETL architecture.
The Design of the Information Hub
The data staging area has
been labeled appropriately and with good reason. With any data warehousing
effort, we all know that data will be transformed and consolidated from any
number of disparate and heterogeneous sources.
However, the design of a
robust and scalable information hub is framed and scoped out by functional and
non-functional requirements. Examples of some of these requirements include
items such as the following:
-
The amount of raw source data to retain after it has been
processed through the ETL data lifecycle
-
The type of server(s) to house the staging area will be dedicated
or shared with other applications and environments (dedicated servers are a
proven way to go)
-
The acceptable levels of data quality, related baselines and
metrics as stated by the Data Governance Board
-
Decisions on the data sources that will be federated in and the
ones that will be a copy of the sources
-
The management of metadata as data sources are brought into the
landing zone of the staging area
-
The level of security and roles defined for each of the areas with
the staging environment
-
The masking/scrambling of sensitive data within staging areas
-
The identification of recoverable artifacts in the event of
disasters, etc.
With these types of
requirements, rules and decisions, a scalable and secured framework is firmly
in place to facilitate the defined ETL methodology. These data sources go
through a number of evolutionary stages in order to build a robust and
comprehensive data warehouse and/or ODS. Moreover, as great data architects
that we are, we know that these stages must include the following.
Data Acquisition
This process includes
landing the data physically or logically in order to initiate the ETL
processing lifecycle. The staging area here could include a series of
sequential files, relational or federated data objects. However, the design of
intake area or landing zone must enable the subsequent ETL processes, as well
as provide direct links and/or integrating points to the metadata repository so
that appropriate entries can be made for all data sources landing in the intake
area.
Data Profiling
Data profiling is the
surveying of the source data landscape to gain an understanding of the
condition of the data sources. In most profiling efforts, this means generating
various reports with any number of metrics, statistics, and counts that reflect
the quality of the source data coming in.
Data Cleansing
Data cleansing is an
iterative set of processes that starts and ends with the business rules and
standards around acceptable data quality levels from the Data Governance Board
(e.g. 95% of the data meets the quality standards). ). This includes investigative
jobs to provide additional detail in detecting data patterns and
design alternatives for quality enforcement at the attribute, record and
aggregate levels and data correction jobs
to fill in missing or incomplete data and correct data values. There is also
the analysis of reports based on the findings and results of the investigation
and data correction jobs to determine if further refinements and/or
modifications are to be made.
Data Standardization and Matching
Data
standardization and matching is a set of processes that primarily consists of
the design and execution, standardizing jobs to create uniformity around
specific mandatory data elements. This includes the design
and execution of matching and de-duplicating jobs to eliminate duplicate data
and create a single version of the truth. It also includes the analysis of
reports related to errors and/or exceptions and determines if further
refinements or modifications are to be made (if required) and to assess the
readiness for data delivery to the data warehouse and ODS.
Data Transformation
Transforming data
essentially means converting data to conform to a standard established by the
Data Governance Board. Examples of data transformations include converting
nulls to specific values, gender codes that are disparate to a common set of
values or even merging multiple source fields to one data element.
Data Loading
Depending on business
requirements, the loading phase can include a total data refresh of the target
component or adding new data to the data component in a historical manner.
Loading to a staged copy of the target component enables a series of validation
exercises. This includes verification of referential integrity, data quality
and transformation rules prior to the actual data population of the DW and/or
ODS.
Design and Construction
The creation of a staging
area will usually start with the typical activities of the design of any data
environment. Tasks such as server configuration, alignment of file systems,
creating the database instances and related database objects are common
elements in the design of any infrastructure dedicate to a data environment.
However, there a number of
unique tasks that need to be completed to align the staging area to the ETL
methodology discussed in prior sections of this article.
For starters, the data
architect and the DBA will need to create separate environments for each stage
that the data goes through. This means separate database and file systems that
are dedicated to the stage that the ETL lifecycle is in.
For example, a dedicated
database instance and related file systems should be created for the data
acquisition and profiling stages. The tasks included in these stages are the
reading of every data element and record in order to generate detailed
statistical information on the source data. This means that processes involved
in the profiling effort will be using tremendous amounts of resources related
to memory and CPU and should be segregated so that other workloads are not
adversely impacted. The design of the database instance must take into
consideration the fact that with the use of federated data, there may be
implications at the database level that will cause ripple effects on the other
data objects within the database instance. Also the file systems allocated to
the containers that the database uses should be separate from the file systems
used in the data acquisition process so that there are no I/O bottleneck
issues.
Then there is the SECURITY
component! This is live production data that has highly sensitive information.
This data cannot be masked and/or scrambled as this defeats the whole purpose
of the ETL process to stage data into the data warehouse or ODS. The raw data
must be exposed in order for the ETL to be as effective in integrating,
cleansing and standardizing all data from all sources. Therefore, having a
robust security framework is an essential ingredient in this configuration.
Typically, the data steward and an appointed business analyst should be among
the chosen few that have access to some of the sensitive data elements. The ETL
developer, DBA and system administrator does not need to see any of it. There
is also the prevention of copying data. No one should be allowed to make copies
of anything for any purpose. The information hub should be able to satisfy all
requests for data access for analysis in a robustly secured environment.
The Information Hub Experience - Tales from the Data Layer
I was assigned to the
first data warehouse project at a major healthcare company. It was our first
time working with an ETL solution and all that comes with it. We successfully
installed the toolset, created the protocols to pull in the data sources and
target data warehouse components. However, it was an extremely painfully
project. Why? Because whenever the ETL processes aborted or there were hardware
failures, there were no clean ways to restart anything! The staging layer was
the sum total of several file systems allocated for ETL usage and not much else
was in place at the staging area level. In other words, we built a flimsy
foundation for the ETL component and we paid dearly for it!
At another time, having
grown from that experience, I worked at another client site as part of a team
to design and construct a data warehouse environment complete with an ETL
solution, etc. This time, I knew I would get it right! I created an information
hub that had file systems and a database, tables and views. This database had
federated objects and every kind of bell and whistle you could think of. Except
that during the data profiling process of the federated objects, the process
ran out of temporary space at the source application and aborted. The error
message generated was that the database is corrupt and all is lost. Talk about
the panic! I had that look in my eyes! Everything ground to a screeching halt
while I completed the database recovery.
The lessons here are to
design the staging layer to enable support of various ETL processing and
related methodology, recoverability and scalability.
A well-designed staging
area should enable the ETL approach, processes and services and the
facilitation of the data management activities with business analysts, data
stewards, (validation of business rules) profiling reports, quality reports and
successfully stage the data required to populate the data warehouse and the
operational data store. Failure to do that will lead to many sleepless nights,
days spent in war rooms and putting the data warehouse project in jeopardy of
not meeting milestones and deadlines. I have been on both sides and not being a
big fan of the war rooms, I now know better. You should too!
Additional Resources
TDWI Staging Area: Critical Component or Overkill?
Oracle Docs Data Warehouse Architecture (with a Staging Area)
Bi-BestPractices.com Four Ways to Build a Data Warehouse
»
See All Articles by Columnist
Denise Rogers