Operationally living in harmony with data warehouses
December 18, 2003
by Duncan Pauly
Burgeoning data volumes continue to place organisations' IT departments under greater strain than ever as they look to harness data effectively while overcoming a minefield of performance issues. The importance of the data warehouse in attaining the maximum benefit from data has grown exponentially, with its promise to provide an enterprise-wide view of business activities and increase the company's profitability through intelligent data handling. Increased sales, more effective and intelligent marketing, enhanced customer services, and streamlined business processes - the data warehouse is regarded in high esteem by organisations as being capable of paving the way towards the attainment of these business benefits.
However, ever since the building of data warehouses began and the term was first coined by industry heavyweight Bill Inmon, the perceived wisdom has been to maintain a separation between the data warehouse and the company's operational systems. Today, while this 'separatist' thinking still dominates among many data warehousing professionals, the approach is being questioned. A school of thought is emerging that challenges the division of the data warehouse from operational systems, citing the fact that the justification for separation is purely technical - as opposed to being a divide founded on the drivers of the business. The evolving wisdom, particularly acknowledging the continuous march of technological progress, is to consider alternatives to the implementation of a traditional warehouse solution, where operational systems are capable of living in harmony with the data warehouse.
Why accept a compromise?
Businesses want data warehouses to provide a complete and immediate understanding of the enterprise, offering the capability to react quickly to the marketing place and to out-manoeuvre their competitors. Business drivers include, for example, increased revenues through more effective marketing and cross selling to the existing customer base, again based on better understanding of customer activity and profiles. Similarly, by identifying inefficiencies and areas of strength, cost reductions can be achieved, while revenues can be increased. The benefits, in short, more than pay for the implementation of the data warehouse.
Then there are the technical considerations that translate to business benefits for the organisation. Today, technology is such that separate systems - with the concomitant investment and ongoing operational costs - could be unnecessary. The same is true of compromise approaches, or "halfway houses," as discussed later, where an intermediate database is used to attempt to furnish intelligent data rapidly. Not only is this approach very much a poor cousin to the data warehouse, yielding inferior data results, but it also exposes the business to greater costs in terms of setting the solution up, maintaining it and then replacing it in the future.
Disharmony and differing demands
The need to cater for disparate demands is why the data warehouse has traditionally been implemented separately from operational systems: they each have different profiles and make different demands on hardware and applications. Technically, therefore, the IT department has faced a range of conflicts between performance, operational and user requirements of the respective systems.
For example, because an operational system is usually built for a transaction processing workload, it needs to cater for multiple concurrent short-lived transactions, mixing queries with updates. The data warehouse, in contrast, supports a smaller user base and longer-lived queries. In detail, for instance, while the data warehouse typically benefits from a disk configuration optimised for high transfer rates, the operational system hardware needs to support a higher volume of individual random disk operations. While it is often a good thing that hardware and operating systems handle mixed workloads, using resources effectively, performance issues arise not with the hardware but from the hosted applications or the database. These are often in the form of contention with resources, or excessive consumption of resources, such as rollback images.
The invisible and the visible
Operational systems and data warehouse systems have different optimal schemas. The former aims to achieve performance and the maintenance of constant transactional integrity, so the schema is designed that way. Users won't be aware of the schema - it will be 'hidden', whereas the schema in a data warehouse is likely to be more visible to users. Why? Because for a data warehouse - in order to fully exploit its potential - the schema needs to be intuitive to users, offering the ability to undertake flexible queries without resorting to multi-way joins. Complex joins are difficult for users and often incur a big performance hit. In this instance, the usual is a de-normalised star schema built from a central fact table surrounded by dimension tables.
Choosing and tuning the indexes
There are issues with indexes too, as the systems require different ones. Query flexibility and the ability to perform index scans efficiently, as well as good selectivity from a combination of predicates in a where clause, means that a bit map index is useful in a data warehouse. However, for an operational system, poor concurrency of bit maps makes it unsuitable as contention and performance degradation is highly likely to occur, resulting from multiple sessions attempting to update and query the same bit map index.
As for hash clusters, they should not be used for full table scans, as often occurs with a data warehouse, because a hashed organised table is spread over more blocks than an equivalent heap organised table. Hash organised tables are, in short, for static data, while B-trees might be used more appropriately for both systems as they provide good concurrency and flexible queries. However, since they can become rapidly disk IO bound and can cause excessive database checkpoint activity, they must be used with care on heavily updated tables of significant size.
Another justification cited for separate systems is scheduling differences. The platform for the two systems is likely to be handled differently in terms of administration and upgrading. The availability requirements of an operational system may be strict, while those for the data warehouse are less so thus giving different scheduling cycles.
Implementation and potential pitfalls
Implementation of a data warehouse means facing several potential pitfalls. Aside from the technical issues, there are many commercial and people-related issues that typically arise with a major development project. From budget for the project to getting resource and skills, there are hurdles at every corner, including overcoming the politics associated with setting up a new area.
But technically too, there are many potential pitfalls. Consider integration with operational systems - a major issue to be addressed. Operational systems will load the data warehouse, necessitating integration at several levels. This is from the fundamental levels (such as network, hardware and software) to application areas - data representation, data semantics and data schema. Integration challenges must be faced up to with the application areas regardless of where the data warehouse is, but issues faced in relation to the network, hardware and software are actually highly likely to result from having chosen to separate the data warehouse from operational systems.
A costly 'half-way house'
Then there is the issue of the playoff between the much desired timeliness and completeness. For any data warehouse to be of benefit to the enterprise, it needs to satisfy both wide-ranging queries aimed at understanding long-term trends, and queries of up-to-the-minute data, which target the current short-term position of the business. It is difficult, if not impossible, to resolve these conflicting requirements within the constraints of the loading window.
This conflict has been difficult to resolve, and in order to do so, many organisations have turned to the typical solution of "the halfway house", as referenced earlier. This is the implementation of an intermediate database that provides the business with its volatile and intermediate summary business information but very much on a restricted query basis. Needless to say, to introduce such a solution introduces more complications to the technical solution - not to mention greater implementation and operational costs. If an operational system and a data warehousing system living in harmony on one server can avoid this one issue alone, apart from all the rest, surely it is worthy of consideration?