Oracle 11gR2 I/O Performance Tuning: Implementing I/O Performance Improvements
April 5, 2011Synopsis: Even though today's enterprise storage system technology has dramatically reduced the impact of I/O performance problems, it's not impossible that an Oracle 11gR2 database could be affected by insufficient I/O bandwidth or latency. Oracle 11g Release 2 (11gR2) makes it easier than ever to isolate and identify potential root causes of poor I/O performance. This article the seventh in this series illustrates how an Oracle DBA and his storage administration counterparts can leverage "the right storage for the right job" from the database's underlying input/output (I/O) subsystem to significantly lessen and possibly even completely resolve I/O performance bottlenecks.
The previous article in this series discussed how to:
The conundrum of 'repeatable' TPC workloadsAs I've used Quest Software's Benchmark Factory (BF) to generate the simulated TPC-E and TPC-H application workloads throughout this article series (and for my own clients too) for the past several years, I've noticed one rather surprising fact: TPC workloads are not really deterministic. In other words, even though I've run the identical TPC-C, TPC-E or TPC-H BF script numerous times against the same database system, I've often noticed that the number of transactions, physical reads and physical writes are rarely identical. Let me stress that this apparent lack of reliable repeatability is not a deficiency of Benchmark Factory; rather, it's inherent in the concept of TPC workloads and well-described in the documentation from the Transaction Processing Council. In fact, the TPC standards actually mandate that workloads should be variable and random to simulate a real-world scenario in which numerous users "pound away" at an OLTP, DSS, or "hybrid" database application.
What I really need to prove is that once I've applied changes to my database's storage subsystem to improve its performance, the database application workload is now performing better or worse after these changes. The only way to do that is to execute precisely the same application workload as many times as necessary, and then use the tools at hand OS utilities, AWR and/or Statspack reports, ADDM analyses, and so forth to obtain unequivocal proof on the storage system's performance. Fortunately, Oracle 11g provides me with exactly the tools to handle this requirement: Database Replay, part of the Real Application Testing (RAT) suite. (Yes, I know it's a separately licensed feature of Oracle 11g; I've documented its potential extensively in a prior set of articles on Oracle 11g Release 1 new features.)
Generating and capturing a realistic, repeatable workloadTo generate a reliable, repeatable workload, I'll adapt and upgrade the same modules I built for the prior Database Replay article series. The resulting "hybrid" workload generated will include a simulated OLTP application that writes several thousand rows to a relatively simple set of Accounts Payable (AP) tables, AP.INVOICES and AP_INVOICE_ITEMS; it will also include a simulated OLAP/DSS workload that joins these tables via SQL statements and views to existing tables in the SH "example" schema that's included within the standard Oracle 11gR2 "seed" database.
NOTE: I've included the complete set of upgraded sample code here; should you decide to adapt it for your demonstration purposes, you should only need to modify the Oracle database name and destinations for your development database's data files. These modules are only for demonstration of these Oracle 11g I/O tuning concepts and database features; they are not intended to be run within a production environment.
Preparing for a successful database replayA key requirement for a successful Oracle 11gR2 Database Replay operation is the ability to return the test database to the point in time just before the application workload was captured, so I've taken the following steps to insure this:
Starting workload captureTo capture the simulated workload, I'll use the code in 11gR2_DBR_Capture.sql to start the capture via Oracle 11gR2 Database Replay. This code first erases any prior like-named workload captures and then starts a new capture named DBR_CAPTURE_100.
Generating the application workloadTo generate the workload, I'll use the code in shell script LG-RandomWorkload.sh to kick off 100 concurrent application user sessions, split between approximately 60% OLAP/DSS processing and 40% OLTP processing. I'll use Oracle 11gR2 Enterprise Manager (EM) to monitor the workload as it progresses towards completion. One nice enhancement in this release of EM is an indicator in the top right hand corner of each tab (see Figure 7.1).
Figure 7.1: Oracle 11gR2 Enterprise Manager: Workload Capture Active Indicator
Completing workload captureOnce the workload completes, I'll conclude its capture from within EM's Database Workload Capture panel. I'll also acknowledge and accept the prompt to create a DataPump export of the AWR snapshot that encapsulates the captured workload. This permits direct comparison of the captured workload's AWR performance snapshot with any future DBR replay performance snapshot(s).
Finding room for improvementAs I showed in the prior article in this series, my next step is to take a look at the AWR reports for the workload that I've captured via Database Capture to determine if there are any candidate areas that I should focus my attention on for possible improvements to my database's I/O performance.
Reviewing the captured workload's performance statisticsAs I expected, the datafiles for the AP.INVOICES and AP.INVOICE_ITEMS data segments stored in the AP_DATA tablespace and those tables' corresponding index segments stored in the AP_IDX tablespace showed the highest utilization, with all the classic hallmarks of an intense hybrid OLTP / DSS environment. Since this simulated workload also uses data from tables in the SH schema, the EXAMPLE tablespace's datafile is also heavily utilized. The Segment Statistics section of the AWR report for the captured workload clearly show this. In addition, the Tablespace IO Stats and File IO Stats sections of the same AWR report shows that the AP_DATA, AP_IDX, and EXAMPLE tablespaces as well as their corresponding datafiles are among the most utilized during the workload.
The first thing I can do to improve the database's I/O performance is to shift these datafiles to faster media or, barring that possibility, at least shift those datafiles to media that's isolated from the other tablespaces that participate in the workload. Fortunately, I've got an excellent trick up my sleeve as an Oracle DBA because it's a relatively simple matter to migrate my database's "hottest" data files to a file system that's uniquely constructed for effective database I/O: Oracle 11gR2 Automatic Storage Management (ASM).
Replaying a captured workloadNow that I've captured a repeatable workload using Oracle 11gR2's Real Application Testing suite, I'm ready to modify my database's underlying storage system to see if I can beneficially improve the database's overall I/O performance. The simplest change will be to migrate the datafiles for the AP_DATA and AP_IDX tablespaces from a traditional file system (in this case, EXT3) to a file system that's specially designed for Oracle database I/O: Automatic Storage Management (ASM).
My database already had an ASM disk group, +DATA, but I've made sure that this disk group's underlying ASM disks are on a different HDD separate from the original datafiles for the AP_DATA, AP_IDX and EXAMPLE tablespaces. (Of course, if this were a production environment I'd have to enlist the help of my storage administrator to find out if these HDD disk partitions were actually on a different physical device, but since I'm wearing both "hats" in my case, I know that these partitions were created on the proper physical device.) Before I migrate the tablespaces to ASM, however, I'll first "rewind" my database to its prior guaranteed restore point:
Migrating the tablespaces to ASMTo migrate the two AP tablespaces to ASM storage, I'll first use Oracle 11gR2 Recovery Manager (RMAN) commands to create an image copy backup of each tablespace's datafiles to the +DATA disk group: Listing 7.1 shows the RMAN command script and the result of the successful backup operation. Note that these three tablespaces remained online and continued to accept transactions while they were being backed up, so there is no disruption to application user activity.
Once the image copy backups are created, I'll take the three tablespaces offline while I switch them to their new destination. Obviously, in a production environment, I'd need to perform this intra-database "switchover" during an off-peak period to minimize disruption to my application user community. But the good news here is that these tablespaces would be unavailable only for the time it would take RMAN to apply any pending redo log transactions for those tablespaces.
Listing 7.2 shows the resulting output from these RMAN commands as well as contents of my database's alert log as these datafiles were successfully "switched" to the image copy backups on the +DATA ASM disk group, and then recovered successfully. Once the recovery was complete, I simply brought the tablespaces back online, and application workloads can now be accepted again.
Replaying a captured workloadNow that I've reconfigured my database for potentially better I/O performance, I'll need to prove that my changes actually produce beneficial results. The best way to do that is to play back the captured workload against that new configuration using Oracle 11gR2's Database Replay features.
To accomplish this, I'll first have to preprocess the previously recorded application workload using the DBMS_WORKLOAD_REPLAY.PROCESS_WORKLOAD procedure:
Once the workload has been preprocessed, I'll prepare the database for its replay through procedures INITIALIZE_REPLAY and PREPARE_REPLAY of the DBMS_WORKLOAD_REPLAY package:
Now that the database is ready to receive the workload, I'll start up a Workload Replay Client (WRC) session that points to the directory containing the previously-recorded, preprocessed workload:
Finally, I'll start the replay by executing procedure DBMS_WORKLOAD_REPLAY.START_REPLAY:
I can track the status of the workload replay via Enterprise Manager. Figure 7.2 shows EM's Database Replay panel once I've started the replay:
Figure 7.2: Oracle 11gR2 Enterprise Manager: Completing the Active Workload Capture As the workload progresses toward its conclusion, I can view how much longer or shorter it's taking to complete via this handy panel (see Figure 7.3):
Figure 7.3: Oracle 11gR2 Enterprise Manager: Completing the Active Workload Capture
Evaluating the new I/O configuration's effectOnce the application workload replay has completed, I'll generate a Workload Replay activity report and an AWR report to analyze if the modification I'd introduced have had a beneficial, neutral, or detrimental effect on my database's performance, and most especially on the three tablespaces I just migrated to ASM from non-ASM storage. Based on the I/O report sections of the AWR report for the replayed workload, I've been able to increase the I/O performance of the three datafiles noticeably:
Next stepsI'm still not 100% satisfied that migrating the tablespaces' datafiles of the hottest segments to ASM is the only solution for improving the application workload's I/O performance. So in the next article is this series, I'll continue my investigations, including:
References and additional readingBefore you proceed to experiment with any of these new features, I strongly suggest that you first look over the corresponding detailed Oracle documentation before trying them out for the first time. I've drawn upon the following Oracle Database 11g Release 2 documents for this article's technical details: