Operationally living in harmony with data warehouses - Page 2
December 18, 2003
Achieving the best query performance
To achieve optimal performance for queries, the schema used by a data warehouse will be different to that used in operational systems, and to convert from an operational to a warehouse schema means employing a combination of de-normalisation and over-normalisation.
De-normalisation combines columns and or tables to avoid join operations at query time. This is usually more intuitive for end users and significantly enhances performance. Over-normalisation partitions tables horizontally and/or vertically to improve performance. Partitioning opens up the possibility of parallel data loading and parallel query execution and also provides an effective method for rolling in and rolling out historical data. This latter aspect can be central to achieving timely updates to the data warehouse.
Data aggregation is another possibility. Using this approach to optimise query performance involves pre-computing sums and counts of data across various dimensions as the data is loaded, meaning queries not having to do this work at query evaluation time. This can substantially improve query performance as a query may only need to fetch a handful of rows from an aggregation table - and the cost of performing such calculations is moved from query time to load time. If load windows are tight, aggregation can become an issue in its own right.
The march of progress: database technology
Database technology has moved on since the underlying architecture for data warehouses was first established. While performance improvements have been achieved for the separated warehouse, they have been based on unchanged fundamental principles and assumptions, and have largely ignored the march of technological progress in database technology. These database enhancements 'moved the goal posts', to the degree that new technologies and database features make it quite reasonable to unite operational and decision support functions - so that they can at last live in harmony with one another.
We are not suggesting that there is a nirvana where a complete methodology for implementing disparate application on a single database exists. That needs time to evolve. However, there are some avenues worth exploring should a full data warehouse not be suitable or worthwhile or where the availability schedule for the operational system is flexible.
Data replication plays a key role here. One database serving both operational and data warehouse systems needs to use data replication to separate the two applications, with the operational system using its conventional transaction optimised schema which is then replicated and mapped to a data warehouse optimised schema on the same database. Using continual incremental synchronisation data is replicated onto the warehousing system. The big question here is, what indexes are now suited to the warehouse schema? Big tables with dynamic data, such as a fact table, rule out B-Tree indexes, as well as hashed clusters. Big tables result in both becoming disk IO bound during index updates, which is likely to seriously impact the synchronisation process. For small or medium cardinality columns (assuming that multiple synchronisation processes are not updating the same index) bit map indexes may be acceptable. But the avoidance of contention between the synchronisation process and user queries must still be considered.
Third-party indexes: a step toward harmony?
Today, there are alternatives. Third-party indexes have been developed to integrate with the database and they can provide the generic functionality of B-trees without suffering the same performance impact during index updates. In addition to being fast to update and fast to query, they can be used for fact tables that need to be continuously synchronised as queries are simultaneously performed. In looking to bring the data warehouse into a harmonious relationship with the operational system, these indexes provide a compelling alternative to conventional B-tree indexes, hash indexes and bit maps. They are key to attaining good performance with incremental updates.
New indexing technologies, such as Adaptive Addressing, provide a more efficient and flexible alternative to conventional indexing techniques, reducing the overheads and constraints traditionally associated with indexing. There are also data aggregation techniques, which can further eliminate some of the indexing requirements and provide fast and flexible access to data analysis.
So the approach is worth considering, as by achieving effective incremental synchronisation, data in the warehouse is far more up-to-date than would be possible with a traditional warehouse, which often involves a bulk transfer of data through file systems and staged rebuilding of indexes at infrequent intervals.
Separate schemas mean that data warehouse tables and indexes can reside on different disks from the operational data, which avoids disk IO contention. And separate dedicated rollback resources can be allocated and tuned to meet the different demands of the small transactions for operational schema and the large transactions for the warehouse schema. It is also worth considering separate buffer pools to guarantee that warehouse queries do not flush out any blocks cached for the operational system.
The block size should suit the operational system, which is fortuitous since the operational system is likely to exist and changing the block size is not an easy option. Setting an appropriate multi-block scan size will be necessary, however, as the warehouse queries will require a larger transfer size for full table scans.
What are likely to stay the same for a co-resident scheme are the decisions made about de-normalising and over-normalising in a separate warehouse schema, which are taken to optimise query performance and allow an efficient roll out of historical data. Optimum implementation of the incremental synchronisation mechanism will demand some additional changes, and attention must be paid to the synchronisation process for transactional consistency, as queries are active while the incremental update is in progress. To make sure that the decision support queries are against a schema optimised for them, any decision support queries should, of course, be confined to the warehouse schema and kept away from the operational schema.
One for all?
Ultimately, it must always be remembered that any organisation needs to objectively assess the strengths and weaknesses of all approaches based on what the business actually needs. But what must be borne in mind is that new database technology is playing a more significant role than before, and that there are ever more compelling arguments for achieving a state of harmony when implementing a co-resident schema with incremental updates.
For one thing, warehouse data is made more relevant and useful due to the timeliness of the update. Additionally, there is likely to be a reduction of operating costs, as deployment of one database negates the need for another hardware platform, of course, as well as bulk data transfers between database instances and their probable resulting complications.
In essence, organisations need to cut their own way through the dense undergrowth of data today and the data warehouse issue. One system will most likely give more equitable sharing of the total resources than separated hardware platforms, but whichever route is taken more disks, more memory and more CPUs is probable.
An important point to remember is that the "halfway house" route may well represent a risky option and is highly likely to result in significantly greater costs to the business in the long term. It is doubtful whether organisations adopting this approach will achieve the true business benefits of data warehousing - such as the rapid access to the type of high quality, reliable and intelligent data that can be used to drive the business forward.
Another key point to keep high on the agenda is the impact of advancements. Today, both traditional and accepted wisdoms are constantly under scrutiny - they are being tested fiercely by evolving technologies and thinking. It is precisely for this reason that co-hosting a data warehouse with an operational system must be a serious consideration for any forward-thinking business.