Achieving the best query performance
To achieve
optimal performance for queries, the schema used by a data warehouse will be
different to that used in operational systems, and to convert from an
operational to a warehouse schema means employing a combination of
de-normalisation and over-normalisation.
De-normalisation
combines columns and or tables to avoid join operations at query time. This is
usually more intuitive for end users and significantly enhances performance.
Over-normalisation partitions tables horizontally and/or vertically to improve
performance. Partitioning opens up the possibility of parallel data loading and
parallel query execution and also provides an effective method for rolling in
and rolling out historical data. This latter aspect can be central to achieving
timely updates to the data warehouse.
Data aggregation
is another possibility. Using this approach to optimise query performance
involves pre-computing sums and counts of data across various dimensions as the
data is loaded, meaning queries not having to do this work at query evaluation
time. This can substantially improve query performance as a query may only need
to fetch a handful of rows from an aggregation table - and the cost of
performing such calculations is moved from query time to load time. If load
windows are tight, aggregation can become an issue in its own right.
The march of progress: database technology
Database
technology has moved on since the underlying architecture for data warehouses
was first established. While performance improvements have been achieved for
the separated warehouse, they have been based on unchanged fundamental
principles and assumptions, and have largely ignored the march of technological
progress in database technology. These database enhancements 'moved the goal
posts', to the degree that new technologies and database features make it quite
reasonable to unite operational and decision support functions - so that they
can at last live in harmony with one another.
We are not
suggesting that there is a nirvana where a complete methodology for
implementing disparate application on a single database exists. That needs time
to evolve. However, there are some avenues worth exploring should a full data
warehouse not be suitable or worthwhile or where the availability schedule for
the operational system is flexible.
Data replication
plays a key role here. One database serving both operational and data warehouse
systems needs to use data replication to separate the two applications, with
the operational system using its conventional transaction optimised schema
which is then replicated and mapped to a data warehouse optimised schema on the
same database. Using continual incremental synchronisation data is replicated
onto the warehousing system. The big question here is, what indexes are now
suited to the warehouse schema? Big tables with dynamic data, such as a fact table,
rule out B-Tree indexes, as well as hashed clusters. Big tables result in both
becoming disk IO bound during index updates, which is likely to seriously impact
the synchronisation process. For small or medium cardinality columns (assuming
that multiple synchronisation processes are not updating the same index) bit
map indexes may be acceptable. But the avoidance of contention between the
synchronisation process and user queries must still be considered.
Third-party indexes: a step toward harmony?
Today, there are
alternatives. Third-party indexes have been developed to integrate with the
database and they can provide the generic functionality of B-trees without
suffering the same performance impact during index updates. In addition to
being fast to update and fast to query, they can be used for fact tables that
need to be continuously synchronised as queries are simultaneously performed.
In looking to bring the data warehouse into a harmonious relationship with the
operational system, these indexes provide a compelling alternative to
conventional B-tree indexes, hash indexes and bit maps. They are key to
attaining good performance with incremental updates.
New indexing
technologies, such as Adaptive Addressing, provide a more efficient and
flexible alternative to conventional indexing techniques, reducing the
overheads and constraints traditionally associated with indexing. There are
also data aggregation techniques, which can further eliminate some of the
indexing requirements and provide fast and flexible access to data analysis.
So the approach
is worth considering, as by achieving effective incremental synchronisation,
data in the warehouse is far more up-to-date than would be possible with a
traditional warehouse, which often involves a bulk transfer of data through
file systems and staged rebuilding of indexes at infrequent intervals.
Separate schemas
mean that data warehouse tables and indexes can reside on different disks from
the operational data, which avoids disk IO contention. And separate dedicated
rollback resources can be allocated and tuned to meet the different demands of
the small transactions for operational schema and the large transactions for
the warehouse schema. It is also worth considering separate buffer pools to
guarantee that warehouse queries do not flush out any blocks cached for the
operational system.
The block size
should suit the operational system, which is fortuitous since the operational
system is likely to exist and changing the block size is not an easy option.
Setting an appropriate multi-block scan size will be necessary, however, as the
warehouse queries will require a larger transfer size for full table scans.
What are likely
to stay the same for a co-resident scheme are the decisions made about de-normalising
and over-normalising in a separate warehouse schema, which are taken to
optimise query performance and allow an efficient roll out of historical data.
Optimum implementation of the incremental synchronisation mechanism will demand
some additional changes, and attention must be paid to the synchronisation
process for transactional consistency, as queries are active while the
incremental update is in progress. To make sure that the decision support
queries are against a schema optimised for them, any decision support queries should,
of course, be confined to the warehouse schema and kept away from the
operational schema.
One for all?
Ultimately, it must
always be remembered that any organisation needs to objectively assess the
strengths and weaknesses of all approaches based on what the business actually
needs. But what must be borne in mind is that new database technology is
playing a more significant role than before, and that there are ever more
compelling arguments for achieving a state of harmony when implementing a
co-resident schema with incremental updates.
For one thing,
warehouse data is made more relevant and useful due to the timeliness of the
update. Additionally, there is likely to be a reduction of operating costs, as
deployment of one database negates the need for another hardware platform, of
course, as well as bulk data transfers between database instances and their
probable resulting complications.
In essence,
organisations need to cut their own way through the dense undergrowth of data
today and the data warehouse issue. One system will most likely give more
equitable sharing of the total resources than separated hardware platforms, but
whichever route is taken more disks, more memory and more CPUs is probable.
An important
point to remember is that the "halfway house" route may well represent a risky
option and is highly likely to result in significantly greater costs to the
business in the long term. It is doubtful whether organisations adopting this
approach will achieve the true business benefits of data warehousing - such as
the rapid access to the type of high quality, reliable and intelligent data
that can be used to drive the business forward.
Another key point
to keep high on the agenda is the impact of advancements. Today, both
traditional and accepted wisdoms are constantly under scrutiny - they are being
tested fiercely by evolving technologies and thinking. It is precisely for this
reason that co-hosting a data warehouse with an operational system must be a
serious consideration for any forward-thinking business.