Keeping The Data Warehouse Clean with a Data Quality Program
August 24, 2010
Every day, information is delivered within a business intelligence solution to facilitate decision making specific to business functions. Having confidence that the information being generated came through a rigorous data quality program is imperative.
With every enterprise data warehousing or data management initiative, there are certain work streams or projects within each of these programs that are quite common. They are the solutions architecture, the data dictionary build, the enterprise data model, metadata management and data quality management.
Each one of these work efforts has a related scope with details that enables a robust and comprehensive enterprise data warehouse and business intelligence framework and solution to be built.
However, with data quality, the ripple effects are far reaching. How far reaching? Think on this scenario; every day information is delivered within a business intelligence solution to facilitate decision making specific to business functions that can impact an organization in the marketplace, its reputation and ability to provide answers to regulatory audits at the local and federal government levels.
It is therefore extremely important that accurate information be provided. So having trust and confidence in that the information being generated came through a rigorous data quality program is huge! Conversely, not having that faith in the data can put organizations in a tailspin trying to validate information being delivered as legitimate. How much time and resources is being spent here? Quite a bit! I lived through a number of those pressure filled war rooms! Trust me; I did not look forward to going to work during those days.
Purpose of Data Quality Management
A good way to have a clear definition of data quality management is to understand its purpose. It's knowing what the goal is.
It is primarily the creation of an overall implementation strategy to improve data quality with a focus on data conformance.
It is an opportunity to improve the quality of the data going into the enterprise data warehouse and business intelligence applications.
It facilitates the integrating of data sources into a consolidated view of the enterprise data warehouse, such that all silo-ed applications with its own data rules now has to conform to a single version of the truth; all data rules now must integrate and conform to one set of rules for data of specific types and formats.
It also empowers the business areas as the data owners, through establishing stewardship roles on the data governance board. These new roles become part of the cultural shift of the business taking on the responsibility and accountability of the quality of the data.
The Data Quality Lifecycle
Once the purpose of data quality management has been clearly defined, the next step is to establish the data quality lifecycle with a specific focus on collaboration with IT and business areas. It's a long and extremely iterative process, with the notion that with each iteration the number of anomalies and errors generated are reduced to the point where the goal has been met.
So it's best to establish long and short term goals to add value sooner rather than later. This ensures stakeholder buy-in such that the long-term initiative can continue with the approval and commitment of resources from senior level management.
To create the data quality lifecycle, a team must first be assembled. The team should include the subject matter experts from the business area (evolving into the role of the data stewards) and experienced IT personnel such as the ETL architect and developer, database administrator and project manager.
A great benefit to the data quality lifecycle is that it enables both the IT and business staff to work as a cohesive unit in order to accomplish the objective: Clean data in the enterprise data warehouse!
With the team created, the next set of activities is specific to establishing the major components of the data quality lifecycle.
These components must include the following:
Defining the requirements for data quality provides the framework for the entire effort. It is during this phase that the entire team meets in work sessions to establish thresholds for acceptable data quality. This phase also defines the mandatory activities that must be completed in order for the data quality effort to be successful. These activities include reviews of documented business functions and/or use cases; identification of candidate data sources; methods for handling rejected data; classification of data elements as mandatory and optional; metrics to measure data quality and related progress. The great part about this phase is the genesis of the business subject matter experts morphing into the roles of the data stewards!
It is in this phase that the team really gets to assess the quality of the data that will be used to source the business intelligence applications. Using data profiling tools as part of the ETL software suite, the team can perform detail data analysis to gain a really good understanding of the condition of the data. It establishes the baseline for the data quality program as well as provides a rough estimate as to the size of the work effort. It's the dose of reality that the team needs to understand exactly the raw materials that they have to work with and the amount of refinement and enrichment that the source data will have to go through in order to create the gold copy!
Design of data quality rules
Once a complete assessment and data analysis has been completed, the team can now start to work on defining the rules to standardize, cleanse and transform the source data. These rules should address the majority of issues that the incoming source data will have.
It is during this activity that one of the first tests of how well the team works together is conducted. The business subject matter experts are the primary authors of the rules and standards to be used. However, these rules must be translated into a set of data quality processes that does exactly what the rules state, so the ETL architects and other IT staff on the project team must work closely with the data stewards to design exactly what the rules and standards have stated.
The rules are then translated and incorporated into detail design specifications for the ETL processes, including all the rules for handling common occurrences of data elements containing nulls and incorrect values and error handling for any anomalies that may be encountered.
Execution of data quality processes
The execution of the data quality processes is all server bound and there is minimal interaction with the team at this point. It is during this phase that time should be taken to complete the documentation required by the project communications management to add a comprehensive set of assets to the project library for use by other initiatives as well as creating an historical record.
It is also important to note that during this phase, depending on the business requirements, that the data can be assigned a codified tag to establish traceability back to the source.
Review of the Results
The review of the results includes a number of activities. The first thing the team should do is compare the results to the initial baseline and the goal of the data quality program as stated in the Information Gathering phase of the project. The team should be able to articulate the improvements made in the quality of the data and whether the objective has been accomplished.
Depending on this review, there may be further refinements of the data quality rule sets and subsequent executions of the data quality processes until the goal is accomplished. Yes, it is a very iterative process. This is why it is so important to establish the scope and goal of the effort such that the team knows exactly when the work is done. There is no waste of resources trying to achieve 100% data quality in the enterprise data warehouse and business intelligence solution when the goal is 90%.
It is also important to note that in the process of reviewing these results, the team should provide feedback to the source applications to possibly correct data errors that have surfaced during the cleansing and standardization process that could resolve issues and improve a business function. It's worth the effort and it goes back to the earlier discussion of accomplishing the short-term goals to improve the current environment, maintain resource commitment and political capital with the senior management.
The Data Quality Program is a team effort…it really is!
A successful data quality program is based more on the cohesion of a team and less on technology and the latest, cutting edge software. It takes people working together towards a clearly defined goal. It takes effective communication, comprehensive knowledge of the subject matter and senior management commitment.
It's hard work and long hours. However, the trust and confidence that the information delivered through the enterprise data warehouse and the downstream business intelligence solutions makes it extremely worthwhile and elevates data as a very valuable asset to any organization!