by Duncan Pauly
Burgeoning data
volumes continue to place organisations’ IT departments under greater strain
than ever as they look to harness data effectively while overcoming a minefield
of performance issues. The importance of the data warehouse in attaining the
maximum benefit from data has grown exponentially, with its promise to provide
an enterprise-wide view of business activities and increase the company’s
profitability through intelligent data handling. Increased sales, more
effective and intelligent marketing, enhanced customer services, and
streamlined business processes – the data warehouse is regarded in high esteem
by organisations as being capable of paving the way towards the attainment of
these business benefits.
However, ever
since the building of data warehouses began and the term was first coined by
industry heavyweight Bill Inmon, the perceived wisdom has been to maintain a
separation between the data warehouse and the company’s operational systems.
Today, while this ‘separatist’ thinking still dominates among many data
warehousing professionals, the approach is being questioned. A school of
thought is emerging that challenges the division of the data warehouse from
operational systems, citing the fact that the justification for separation is
purely technical – as opposed to being a divide founded on the drivers of the
business. The evolving wisdom, particularly acknowledging the continuous march
of technological progress, is to consider alternatives to the implementation of
a traditional warehouse solution, where operational systems are capable of
living in harmony with the data warehouse.
Why accept a compromise?
Businesses want
data warehouses to provide a complete and immediate understanding of the
enterprise, offering the capability to react quickly to the marketing place and
to out-manoeuvre their competitors. Business drivers include, for example,
increased revenues through more effective marketing and cross selling to the
existing customer base, again based on better understanding of customer
activity and profiles. Similarly, by identifying inefficiencies and areas of
strength, cost reductions can be achieved, while revenues can be increased. The
benefits, in short, more than pay for the implementation of the data warehouse.
Then there are
the technical considerations that translate to business benefits for the
organisation. Today, technology is such that separate systems – with the
concomitant investment and ongoing operational costs – could be unnecessary.
The same is true of compromise approaches, or “halfway houses,” as discussed
later, where an intermediate database is used to attempt to furnish intelligent
data rapidly. Not only is this approach very much a poor cousin to the data
warehouse, yielding inferior data results, but it also exposes the business to
greater costs in terms of setting the solution up, maintaining it and then
replacing it in the future.
Disharmony and differing demands
The need to cater
for disparate demands is why the data warehouse has traditionally been
implemented separately from operational systems: they each have different
profiles and make different demands on hardware and applications. Technically,
therefore, the IT department has faced a range of conflicts between
performance, operational and user requirements of the respective systems.
For example,
because an operational system is usually built for a transaction processing
workload, it needs to cater for multiple concurrent short-lived transactions, mixing
queries with updates. The data warehouse, in contrast, supports a smaller user
base and longer-lived queries. In detail, for instance, while the data
warehouse typically benefits from a disk configuration optimised for high
transfer rates, the operational system hardware needs to support a higher
volume of individual random disk operations. While it is often a good thing
that hardware and operating systems handle mixed workloads, using resources
effectively, performance issues arise not with the hardware but from the hosted
applications or the database. These are often in the form of contention with
resources, or excessive consumption of resources, such as rollback images.
The invisible and the visible
Operational
systems and data warehouse systems have different optimal schemas. The former
aims to achieve performance and the maintenance of constant transactional
integrity, so the schema is designed that way. Users won’t be aware of the
schema – it will be ‘hidden’, whereas the schema in a data warehouse is likely
to be more visible to users. Why? Because for a data warehouse – in order to
fully exploit its potential – the schema needs to be intuitive to users,
offering the ability to undertake flexible queries without resorting to
multi-way joins. Complex joins are difficult for users and often incur a big
performance hit. In this instance, the usual is a de-normalised star schema
built from a central fact table surrounded by dimension tables.
Choosing and tuning the indexes
There are issues
with indexes too, as the systems require different ones. Query flexibility and
the ability to perform index scans efficiently, as well as good selectivity
from a combination of predicates in a where clause, means that a bit map index
is useful in a data warehouse. However, for an operational system, poor
concurrency of bit maps makes it unsuitable as contention and performance
degradation is highly likely to occur, resulting from multiple sessions
attempting to update and query the same bit map index.
As for hash clusters,
they should not be used for full table scans, as often occurs with a data
warehouse, because a hashed organised table is spread over more blocks than an
equivalent heap organised table. Hash organised tables are, in short, for
static data, while B-trees might be used more appropriately for both systems as
they provide good concurrency and flexible queries. However, since they can
become rapidly disk IO bound and can cause excessive database checkpoint
activity, they must be used with care on heavily updated tables of significant
size.
Another
justification cited for separate systems is scheduling differences. The
platform for the two systems is likely to be handled differently in terms of
administration and upgrading. The availability requirements of an operational
system may be strict, while those for the data warehouse are less so thus
giving different scheduling cycles.
Implementation and potential pitfalls
Implementation of
a data warehouse means facing several potential pitfalls. Aside from the
technical issues, there are many commercial and people-related issues that
typically arise with a major development project. From budget for the project
to getting resource and skills, there are hurdles at every corner, including
overcoming the politics associated with setting up a new area.
But technically
too, there are many potential pitfalls. Consider integration with operational
systems – a major issue to be addressed. Operational systems will load the data
warehouse, necessitating integration at several levels. This is from the
fundamental levels (such as network, hardware and software) to application
areas – data representation, data semantics and data schema. Integration
challenges must be faced up to with the application areas regardless of where
the data warehouse is, but issues faced in relation to the network, hardware
and software are actually highly likely to result from having chosen to
separate the data warehouse from operational systems.
A costly
‘half-way house’
Then there is the
issue of the playoff between the much desired timeliness and completeness. For
any data warehouse to be of benefit to the enterprise, it needs to satisfy both
wide-ranging queries aimed at understanding long-term trends, and queries of
up-to-the-minute data, which target the current short-term position of the
business. It is difficult, if not impossible, to resolve these conflicting
requirements within the constraints of the loading window.
This conflict has
been difficult to resolve, and in order to do so, many organisations have
turned to the typical solution of “the halfway house”, as referenced earlier. This
is the implementation of an intermediate database that provides the business
with its volatile and intermediate summary business information but very much
on a restricted query basis. Needless to say, to introduce such a solution
introduces more complications to the technical solution – not to mention
greater implementation and operational costs. If an operational system and a
data warehousing system living in harmony on one server can avoid this one
issue alone, apart from all the rest, surely it is worthy of consideration?