A Practical Guide to Data Warehousing in Oracle, Part I
November 25, 2003
If there is one adjustment to make to an OLTP mindset when you start working with data warehouses that I would emphasize more than any other, then it is that your new Enemy Number One is the disk subsystem. Nothing will cripple a data warehouse more effectively than a failure to manage your i/o requirement vs. your i/o capacity, and there will be continual reminders of this throughout the series.
Having said that, we move on to...
A Different Look at Disk Capacity
When you are defining hardware requirements for your data warehouse, a major component that defines your system's capabilities is going to be the disk subsystem. You may have some early data volume estimations for your first year of operations, but before you multiply-by-four and fill in the purchase order there are some important considerations you should know about.
I am writing this article on a laptop computer with a 60 GB internal hard drive, and an external 120 GB hard drive on a USB2.0 port. Now you could fit a pretty respectable 120 GB data warehouse in that sort of disk capacity, but of course no-one in their right minds would try to, for two reasons.
Firstly, there is no redundancy -- a single disk failure is a total system failure.
Secondly, whatever the wonders of USB2.0 interfaces it is just too slow.
So just how much disk capacity do you need in order to run a 120 GB data warehouse successfully? Obviously, you need at least 2 x 120 GB, in order to provide redundancy, because you are not even going to think about using RAID5 for your data warehouse - you are going to Stripe And Mirror Everything (SAME). Even then, how are you going to decide what disks you need, and how many?
Let's look at an example. If I have a table of 60 GB on my 120 GB USB2.0 drive, at a theoretical maximum read rate of 30MB/sec say, it is going to take me at least 30 minutes to full scan it, plus any time for which another process interrupts my work. That might be OK for an overnight batch job on an OLTP database, but it is a finger-tappingly long time to wait for an interactive query to return a twelve-line report.
If I could spread that data over eight disks, each with a read rate of 30Mb/sec, then my uninterrupted full scan would take me a little over four minutes. Thirty-two disks would give me one minute - throw some more sessions into the mix, and maybe the timing goes back up to four minutes. If this is the performance target I want to meet, and an economical disk selection turns out to be units of 64 GB each, then I am going to be requesting 2Tb of raw disk capacity for a database that you could carry around on a single external USB drive.
However, what you are really buying is a disk subsystem with a theoretical i/o capacity close to 1Gb/sec.
Therefore, here is a major paradigm shift. For years you may have planned capacity on an OLTP system in terms of how much disk space you require for the next six, twelve, or twenty-four months to ensure that your projected growth does not exceed your available space capacity. In the data warehousing world you may find that when your data volume reaches 50% of available capacity you are severely limited in terms of how quickly you can read that data - long before you face a space availability crisis you are already in an i/o crisis. The i/o crisis is sometimes a more difficult issue to manage because it doesn't have a clear point at which the system fails, in the manner of a disk space availability crisis. Performance just steadily declines, potentially rendering the system less and less usable from the user's point of view.
Be prepared not only to accept this new paradigm yourself, but also to lay out clear justification to project managers for your purchasing requirements.
Long-story-short: You are no longer buying Gigabytes - you are buying Gigabytes per second, and this is the new metric by which you should make your purchasing decisions.
Tablespaces and Data Files...
When you are thinking about your arrangement of tablespaces and data files, there are a number of desirable characteristics to aim for.
Now I recall that one of the earliest "facts" that I learned about Oracle tablespaces was that you should separate index and table segments into different sets of tablespaces. The rationale behind this was that it "de-conflicted" the i/o requirement for tables and their indexes, thus spreading i/o more evenly across the disks.
Well, that turned out to be incorrect. As has been well discussed elsewhere there is no performance advantage to this separation, because the index and the table are not accessed simultaneously. Indeed, it is more likely to be counter-productive, as it reduces the number of disks over which a table or index segment can be spread, and thus it increases the time taken to scan the segment by forcing the operation to contend with itself for i/o resources. So much for that idea - it failed both the ease-of-management and the optimum-performance tests.
The next thing I learned was that the management burden for tablespaces was very much simplified by separating segments into different tablespaces according to their size. Small, medium and large segments would each have their own tablespace, or maybe a set of tablespaces, and uncommonly large segments would have their own dedicated tablespace. For an OLTP system this could mean only having to maintain three or four tablespaces, other than Temp, RBS, System etc.
The simplification also came from setting initial and next extent sizes to a constant and pctincrease=0 at the tablespace level. This allowed full use of the tablespace without running into free space fragmentation issues, and with locally managed tablespaces (LMT's) this became even easier, as the constant extent size could be absolutely enforced, and 100% availability of free space was guaranteed.
A further characteristic of LMT's with uniform extents is the "round-robin" extent allocation methodology that Oracle uses. This enables us to achieve that rarest of goals - a performance improvement method that is both cheap and easy. A data segment of 1Gb can be assigned to an empty tablespace with uniform extent size of 1Mb, and eight data files, and you will get 128Mb of that segment on each datafile, with no two logically consecutive extents on the same datafile.
Therefore, manageability is improved, and one of the consequences of abandoning the religious prohibition on the mixing of tables and indexes leads you to the conclusion that the data files of your tablespace might as well be spread across all your available devices. Moreover, with the simplification of managing each individual tablespace comes the realization that tablespace management is simple enough for you to start letting those tablespaces proliferate again.
How about a single table spread across multiple tablespaces? Next time we will look at situations when that might be just the ticket.
Long-story-short: Use locally managed tablespaces with uniform extent sizes. For each reasonable-sized tablespace spread the i/o burden by using multiple data files that cover every available i/o device. Do not separate tables and indexes for performance reasons. Do not be afraid of having many tablespaces - as long as each one is correctly defined to be easy to manage, the whole set will follow along.
Bonus Subject: Control Your Feature Set
This is such an important issue that I considered devoting a whole article to it, but since 1,000 words of me getting preachy on a pure matter-of-opinion would probably be too much to stomach, I will keep it short.
An OLTP system can get along just fine on features of Oracle that have been available for many years - btree indexes, clusters, heap tables. There have been improvements in the ease of writing code, in the manageability of memory, and other issues, but let me venture to say that they are not essential to the design, build and operation of a robust and performant OLTP system.
In contrast, the features of Oracle that are almost entirely dedicated to data warehousing are not only dazzling with promises of unheard of leaps in performance, but also, and this is the key point, almost all very recent additions to the Oracle feature set.
Bitmap indexes appeared back around v7.3, but there were still some pretty serious issues up to 8i - change the nullability of a column and all your bitmap indexes went unusable on every partition. That was a real doozy.
Materialized views were pretty DW-friendly in 8i, but without Partition Change Tracking (9i, I believe) maintenance of them on a typical DW fact table was, shall we say, a challenge.
In 9iR2 we can now create a Hierarchical OLAP Cube, based on a composite range-list partitioned table with compressed data segments, indexed with a mix of bitmap and btree indexes (some compressed, others not) that you could store on tablespaces with different block sizes to that of the table, with disabled/novalidated/rely constraints against which we expect Oracle to perform general query rewrite and partition change tracking.
Now I am not saying that it is not going to work perfectly, but you had better be ready to test every possible operation you will ever want to perform on that object prior to throwing it at your users. You will want to know in advance that you will not be able to add a column to a compressed table, nor can you delete a column from it. You may think you can rely on using the documented ability to compress data using "Create Table ... Compress ... As Select ..." method, but what are you going to do when/if it just does not work? Pre-create the table and use "Insert /*+ append */ Into ..."? Watch out for that ORA-28604 if you do! One of my most recent attempts to avoid this error involved populating a fact table prior to applying not null constraints. Pretty foolproof, until neither the optimizer nor the DESCRIBE command recognized that such constraints existed on the table. Ouch.
You have to remember that when you throw ten different features into the air, they can land in a great many configurations. The exact combination of features you plan on implementing may well have been rigorously tested, but has it been tested on your platform, with your combination of initiation parameters, on a table with more than 16 columns? Until you have done that series of tests, do yourself a favor and assume that it hasn't.
Long-story-short: Test everything about the new feature you plan on implementing. Search Metalink for known bugs associated with the feature, and look for comments and questions in forums that might imply the existence of unknown bugs. Only implement new features when they show clear benefits to the system.
Next time ...
Tables: Partitioning them, compressing them, and more on when to say "No" to features.