Identifying and Evaluating I/O Bottlenecks in Oracle Database
December 29, 2010
Synopsis. Oracle 11g Release 2 (11gR2) provides excellent tools that help isolate and identify potential root causes of poor I/O performance. This article – the sixth in this series – illustrates the wealth of I/O performance tuning information that Oracle 11gR2’s Automatic Workload Repository (AWR) reports provide, how to use that information to determine if any I/O performance bottlenecks actually exist, and how to form initial hypotheses to resolve those bottlenecks within the database’s underlying input/output (I/O) subsystem.
The prior article in this series continued the demonstration of how to use Quest Software’s Benchmark Factory tool in concert with Oracle Database 11gR2 to:
Now that I’ve at last successfully generated both TPC-H and TPC-E workloads, it’s time to turn my attention to how well my database’s I/O infrastructure is actually performing … and that means it’s time to pore over the resulting performance metrics that Oracle 11gR2 is quite excellent at capturing. In this article I’ll therefore concentrate on:
I/O Performance Analysis: First Steps
One of the features of Quest’s Benchmark Factory tool that I really appreciate is the ability to view in real time how the simulated application workload being generated is affecting components of my database system’s performance. Figure 6.1.1 below shows an example of the TPC-E 50-User application workload generation at just about the halfway mark of its execution cycle. As you can see, several “spikes” have occurred as the workload progressed, most notably in the average execution time of the workload and the average throughput in bytes per second.
While this is certainly interesting information, it really tells me very little about how well the database system is really handling the generated workload’s demand. To find that out, I’ll turn my attention to what I’ve found is the most valuable avenue for attacking just about any Oracle database performance issue, whether perceived or real: the Automatic Workload Repository (AWR) reports for the time period(s) when sub-optimal or questionable database response times have been reported.
(Yes, I know: AWR is a separately licensed feature of the Oracle 11gR2 Diagnostic Pack, but in my experience, it has become the de facto database performance tuning tool for most Oracle database shops. While Statspack reports will certainly provide sufficient details for I/O performance tuning, AWR reports offer significantly more detail, are easier to navigate, and are more efficient to execute because they access performance metric data that’s already been captured within AWR snapshots … and those snapshots were gathered in real time using direct memory access.)
I’ve captured the complete contents of the AWR reports for both the TPC-E 50-user workload as well as for the TPC-H 15-user workload, but for now I’ll focus on separate sections of these reports to demonstrate what specific information I normally investigate to determine if there are any real I/O performance-related issues.
Comparing AWR Reports from Different Periods. It’s not unusual for Oracle DBAs to support competing workloads at different times of the month, week, and even day – especially for “hybrid” databases that may have started out as OLTP platforms, but now encapsulate a DSS or OLAP reporting component. But trying to identify the I/O performance “footprints” that represent application workloads over different timeframes can be downright tedious! I’ve found that many Oracle DBAs aren’t aware of one of my favorite AWR features: its ability to highlight dramatic differences between two different application workloads by directly comparing their corresponding performance metrics via its Compare Periods feature.
To illustrate, I’ve created an AWR Compare Period Report that shows the difference between the TPC-E and TPC-H workloads. What I really like about this feature is that it immediately highlights the major differences between the application workloads generated during these two timeframes. For example, note the significant difference between:
I/O Performance Analysis: Top Five Wait Events
The first thing I’ll focus on is whether or not there are any I/O related wait events that need investigation, and I’ll do that by reviewing the Top Five Wait Events section in each of the AWR reports. I’ve reproduced them for the TPC-E and TPC-H application workloads in Figure 6.2.1 and Figure 6.2.2 below. (For an excellent summary of what each of these wait events actually portend, I recommend Section 10.3 of the Oracle Database 11gR2 Performance Tuning Guide.)
Figure 6.2.1. TPC-E Workload: Top Five Wait Events
Figure 6.2.2. TPC-H Workload: Top Five Wait Events
From these two report sections, it’s pretty obvious that my database is performing significant amounts of I/O during each of these application workload execution timeframes…and that’s not necessarily a bad thing. After all, if my database is spending most of its time executing SQL statements and either reading data from or writing data to my database’s datafiles, then it’s doing work rather than waiting for, say, a latch or mutex to be released. What’s important at this stage of the analysis is to realize that there may indeed be opportunities for tuning the I/O subsystem or even the database instance itself to take better advantage of what’s clearly already a pretty well-tuned set of SQL statements.
I/O Performance Analysis: I/O Profiles
I/O Profile Section. Starting in Oracle 11gR1, I/O-related performance metrics and statistics are also available, both from within Enterprise Manager for real-time I/O performance “firefighting” and within AWR snapshots. Oracle 11g breaks out this information across several dimensions:
Figure 6.3.1. TPC-E Workload: I/O Profile Within Function
Figure 6.3.2. TPC-E Workload: I/O Profile Within File Type
Figure 6.3.3. TPC-E Workload: I/O Profile Within Function and File Type
I’ve provided a link to the I/O Profile Section for the TPC-H workload’s corresponding AWR report.
I/O Performance Analysis: Tablespace and Datafile I/O Profiles
The final two sections of the AWR report that are most helpful for isolating I/O bottlenecks describe the I/O patterns for the database’s tablespaces and each tablespace’s corresponding datafiles. For the TPC-E workload, I’ve captured just the first five tablespaces that are the “hottest” in terms of combined read/write activity. This report section makes it simple to see that the TPCE_TRADE_DATA tablespace is getting pummeled the hardest. This makes sense since that tablespace contains the data segments for table TPCE.E_TRADE_DATA, and that table is most heavily accessed during the TPC-E workload:
Figure 6.4.1. TPC-E Workload: Five “Hottest” Tablespaces
The picture is quite different, of course, for the corresponding tablespaces for the TPCH schema objects. Here, just two of the several tablespaces were accessed predominantly during that workload’s execution, and most prominent was the temporary tablespace named TPCH_BIGTEMP. This is not unexpected, since the TPC-H workload is meant to simulate a DSS environment, and that means there’s going to be a fair amount of sorting and aggregation that takes place outside the database’s program global area (PGA):
Figure 6.4.2. TPC-H Workload: Five “Hottest” Tablespaces
Datafile I/O Metrics. Since this section of the AWR report tends to be among the lengthiest, I’ve provided links to the extremely granular datafile I/O information for the TPC-E workload and the TPC-H workload, respectively. I’ll use the information within these report sections in the next article in this series to probe which datafiles and temporary files might benefit most from faster I/O devices or removal of other potential I/O bottlenecks.
I/O Performance: Hypotheses For Improvement
Now that I’ve gathered these metrics, I can begin to make some intelligent decisions about where to concentrate my I/O performance tuning efforts:
I’ll begin to provide answers to the questions I’ve just proposed in the next article in this series, which will concentrate on how to:
References and Additional Reading
Before you proceed to experiment with any of these new features, I strongly suggest that you first look over the corresponding detailed Oracle documentation before trying them out for the first time. I’ve drawn upon the following Oracle Database 11g Release 2 documents for this article’s technical details:
E10881-02 Oracle Database 11gR2 New Features
E10595-05 Oracle Database 11gR2 Administrator’s Guide
E10713-03 Oracle Database 11gR2 Concepts
E10820-02 Oracle Database 11gR2 Reference
E10500-02 Oracle Database 11gR2 Storage Administrator’s Guide