SHARE

Keeping The Data Warehouse Clean with a Data Quality Program

Written By

Aug 24, 2010

7 minute read

Every day, information is delivered within a business intelligence solution to facilitate decision making specific to business functions. Having confidence that the information being generated came through a rigorous data quality program is imperative.

With every enterprise data warehousing or data management initiative, there
are certain work streams or projects within each of these programs that are
quite common. They are the solutions architecture, the data dictionary build,
the enterprise data model, metadata management and data quality management.

Each one of these work efforts has a related scope with details that enables
a robust and comprehensive enterprise data warehouse and business intelligence
framework and solution to be built.

However, with data quality, the ripple effects are far reaching. How far
reaching? Think on this scenario; every day information is delivered within a
business intelligence solution to facilitate decision making specific to
business functions that can impact an organization in the marketplace, its
reputation and ability to provide answers to regulatory audits at the local and
federal government levels.

It is therefore extremely important that accurate information be provided.
So having trust and confidence in that the information being generated came
through a rigorous data quality program is huge! Conversely, not having that
faith in the data can put organizations in a tailspin trying to validate
information being delivered as legitimate. How much time and resources is being
spent here? Quite a bit! I lived through a number of those pressure filled war
rooms! Trust me; I did not look forward to going to work during those days.

Purpose of Data Quality Management

A good way to have a clear definition of data quality management is to
understand its purpose. It’s knowing what the goal is.

It is primarily the creation of an overall implementation strategy to
improve data quality with a focus on data conformance.

It is an opportunity to improve the quality of the data going into the
enterprise data warehouse and business intelligence applications.

It facilitates the integrating of data sources into a consolidated view of
the enterprise data warehouse, such that all silo-ed applications with its own
data rules now has to conform to a single version of the truth; all data rules
now must integrate and conform to one set of rules for data of specific types
and formats.

It also empowers the business areas as the data owners, through establishing
stewardship roles on the data governance board. These new roles become part of
the cultural shift of the business taking on the responsibility and
accountability of the quality of the data.

The Data Quality Lifecycle

Once the purpose of data quality management has been clearly defined, the
next step is to establish the data quality lifecycle with a specific focus on
collaboration with IT and business areas. It’s a long and extremely iterative
process, with the notion that with each iteration the number of anomalies and
errors generated are reduced to the point where the goal has been met.

So it’s best to establish long and short term goals to add value sooner
rather than later. This ensures stakeholder buy-in such that the long-term
initiative can continue with the approval and commitment of resources from
senior level management.

To create the data quality lifecycle, a team must first be assembled. The
team should include the subject matter experts from the business area (evolving
into the role of the data stewards) and experienced IT personnel such as the
ETL architect and developer, database administrator and project manager.

A great benefit to the data quality lifecycle is that it enables both the IT
and business staff to work as a cohesive unit in order to accomplish the
objective: Clean data in the enterprise data warehouse!

With the team created, the next set of activities is specific to
establishing the major components of the data quality lifecycle.

These components must include the following:

Information Gathering

Defining the requirements for data quality provides the framework for the
entire effort. It is during this phase that the entire team meets in work
sessions to establish thresholds for acceptable data quality. This phase also
defines the mandatory activities that must be completed in order for the data
quality effort to be successful. These activities include reviews of documented
business functions and/or use cases; identification of candidate data sources;
methods for handling rejected data; classification of data elements as
mandatory and optional; metrics to measure data quality and related progress.
The great part about this phase is the genesis of the business subject matter
experts morphing into the roles of the data stewards!

Data Assessments

It is in this phase that the team really gets to assess the quality of the
data that will be used to source the business intelligence applications. Using
data profiling tools as part of the ETL software suite, the team can perform
detail data analysis to gain a really good understanding of the condition of
the data. It establishes the baseline for the data quality program as well as
provides a rough estimate as to the size of the work effort. It’s the dose of
reality that the team needs to understand exactly the raw materials that they
have to work with and the amount of refinement and enrichment that the source
data will have to go through in order to create the gold copy!

Design of data quality rules

Once a complete assessment and data analysis has been completed, the team
can now start to work on defining the rules to standardize, cleanse and
transform the source data. These rules should address the majority of issues
that the incoming source data will have.

It is during this activity that one of the first tests of how well the team
works together is conducted. The business subject matter experts are the
primary authors of the rules and standards to be used. However, these rules
must be translated into a set of data quality processes that does exactly what
the rules state, so the ETL architects and other IT staff on the project team
must work closely with the data stewards to design exactly what the rules and
standards have stated.

The rules are then translated and incorporated into detail design
specifications for the ETL processes, including all the rules for handling
common occurrences of data elements containing nulls and incorrect values and
error handling for any anomalies that may be encountered.

Execution of data quality processes

The execution of the data quality processes is all server bound and there is
minimal interaction with the team at this point. It is during this phase that
time should be taken to complete the documentation required by the project
communications management to add a comprehensive set of assets to the project
library for use by other initiatives as well as creating an historical record.

It is also important to note that during this phase, depending on the
business requirements, that the data can be assigned a codified tag to
establish traceability back to the source.

Review of the Results

The review of the results includes a number of activities. The first thing
the team should do is compare the results to the initial baseline and the goal
of the data quality program as stated in the Information Gathering phase of the
project. The team should be able to articulate the improvements made in the
quality of the data and whether the objective has been accomplished.

Depending on this review, there may be further refinements of the data
quality rule sets and subsequent executions of the data quality processes until
the goal is accomplished. Yes, it is a very iterative process. This is why it
is so important to establish the scope and goal of the effort such that the
team knows exactly when the work is done. There is no waste of resources trying
to achieve 100% data quality in the enterprise data warehouse and business
intelligence solution when the goal is 90%.

It is also important to note that in the process of reviewing these results,
the team should provide feedback to the source applications to possibly correct
data errors that have surfaced during the cleansing and standardization process
that could resolve issues and improve a business function. It’s worth the
effort and it goes back to the earlier discussion of accomplishing the
short-term goals to improve the current environment, maintain resource
commitment and political capital with the senior management.

The Data Quality Program is a team effort…it really is!

A successful data quality program is based more on the cohesion of a team and
less on technology and the latest, cutting edge software. It takes people
working together towards a clearly defined goal. It takes effective
communication, comprehensive knowledge of the subject matter and senior
management commitment.

It’s hard work and long hours. However, the trust and confidence that the
information delivered through the enterprise data warehouse and the downstream
business intelligence solutions makes it extremely worthwhile and elevates data
as a very valuable asset to any organization!

Additional Resources

Melissa Data Helps Developers Improve the Quality of Business Data
Garbage Time: How to Improve the Quality of Business Data

»

See All Articles by Columnist
Denise Rogers

Denise Rogers

enise is a data architect with several years experience in the deployment of data architectures at major healthcare insurance companies and state government. She is a certified PMP that has designed and deployed a number of data solutions ranging from transactional to decision support within various architectural and project management frameworks. She has also spearheaded a number of efforts related to database environment assessments, capacity planning and performance tuning. In the past, Denise has held several user group positions including participation in International DB2 User Group (IDUG) and internal architectural groups. She has presented solutions to division heads at the within state government as well as conducted a number of company related training and information sharing sessions on database performance tuning techniques, best practices, etc . She has also mentored and coached project team members at various levels of expertise including university recruits, business users and senior IT staff. Denise graduated from Greater Hartford Community College Cum Laude in 1983 with an Associate’s degree in Management Information Systems.