Synopsis: Even though today’s enterprise storage system technology has dramatically reduced the impact of I/O performance problems, it’s not impossible that an Oracle 11gR2 database could be affected by insufficient I/O bandwidth or latency. Oracle 11g Release 2 (11gR2) makes it easier than ever to isolate and identify potential root causes of poor I/O performance. This article — the seventh in this series — illustrates how an Oracle DBA and his storage administration counterparts can leverage “the right storage for the right job” from the database’s underlying input/output (I/O) subsystem to significantly lessen and possibly even completely resolve I/O performance bottlenecks.
The previous articlein this series discussed how to:
- Evaluate AWR reports for indicators of any I/O performance bottlenecks
- Formulate initial hypotheses on how well my database’s storage infrastructure is responding to different application workloads
Since I now have relatively accurate information on what my storage system’s performance is under two different simulated application workloads, I now know exactly where to apply the appropriate leverage to improve that performance … and just as importantly, which parts of the storage system that I’d be wasting time and effort to attempt improve performance. This article will therefore investigate how to:
- Improve I/O response time via intra-database solutions
- Improve I/O response time through enhancements to the I/O subsystem
But first let me address one of the more puzzling issues I’ve encountered in my discussions with DBAs and storage administrators in the past several months as well as during my experiments throughout this article series: the apparent “unrepeatability” of TPC workloads.
The conundrum of ‘repeatable’ TPC workloads
As I’ve used Quest Software’s Benchmark Factory (BF) to generate the simulated TPC-E and TPC-H application workloads throughout this article series (and for my own clients too) for the past several years, I’ve noticed one rather surprising fact: TPC workloads are not really deterministic. In other words, even though I’ve run the identical TPC-C, TPC-E or TPC-H BF script numerous times against the same database system, I’ve often noticed that the number of transactions, physical reads and physical writes are rarely identical. Let me stress that this apparent lack of reliable repeatability is nota deficiency of Benchmark Factory; rather, it’s inherent in the concept of TPC workloads and well-described in the documentation from the Transaction Processing Council. In fact, the TPC standards actually mandate that workloads should be variable and random to simulate a real-world scenario in which numerous users “pound away” at an OLTP, DSS, or “hybrid” database application.
What I really need to prove is that once I’ve applied changes to my database’s storage subsystem to improve its performance, the database application workload is now performing better or worse after these changes. The only way to do that is to execute precisely the same application workload as many times as necessary, and then use the tools at hand — OS utilities, AWR and/or Statspack reports, ADDM analyses, and so forth — to obtain unequivocal proof on the storage system’s performance. Fortunately, Oracle 11g provides me with exactly the tools to handle this requirement: Database Replay, part of the Real Application Testing (RAT) suite. (Yes, I know it’s a separately licensed feature of Oracle 11g; I’ve documented its potential extensively in a prior set of articles on Oracle 11g Release 1 new features.)
Generating and capturing a realistic, repeatable workload
To generate a reliable, repeatable workload, I’ll adapt and upgrade the same modules I built for the prior Database Replay article series. The resulting “hybrid” workload generated will include a simulated OLTP application that writes several thousand rows to a relatively simple set of Accounts Payable (AP) tables, AP.INVOICES and AP_INVOICE_ITEMS; it will also include a simulated OLAP/DSS workload that joins these tables via SQL statements and views to existing tables in the SH “example” schema that’s included within the standard Oracle 11gR2 “seed” database.
NOTE: I’ve included the complete set of upgraded sample code here; should you decide to adapt it for your demonstration purposes, you should only need to modify the Oracle database name and destinations for your development database’s data files. These modules are only for demonstration of these Oracle 11g I/O tuning concepts and database features; they are not intended to be run within a production environment.
Preparing for a successful database replay
A key requirement for a successful Oracle 11gR2 Database Replay operation is the ability to return the test database to the point in time just before the application workload was captured, so I’ve taken the following steps to insure this:
- So that my test database can always be returned to the same state before I begin a new round of testing, I’ll leverage one of the most helpful features that Oracle added in Release 10gR2: the ability to “rewind” a database back to the same point in time using Flashback Database. This feature requires the use of flashback logs that are automatically written to the database’s Fast Recovery Area to rewind the database back to a chosen point in time. Note that the database must be in ARCHIVELOG mode to enable flashback log generation.
- To make it easier to manage the initial starting point of all testing, I’ll declare a guaranteed restore point that will quite literally guarantee that Oracle 11gR2 will retain all archived redo logs and flashback logs needed to rewind the database back to an exact point in time just before the Database Capture operation started.
- I’ll also turn off automatic generation of AWR snapshots; instead, I’ll opt to allow Database Replay to control AWR snapshot creation immediately before and just after a Capture or Replay operation.
Once I’ve got my application environment initialized I’ll issue the commands shown below to enable flashback logging, reset automatic snapshot generation, and declare a guaranteed restore point called GOLDENDB:
----- -- Modifies AWR Snapshot settings: -- 1.) Snapshots are NEVER taken automatically -- 2.) Snapshots will be retained FOREVER (MAX_RETENTION, or 100 years) -- 3.) The default number of SQL statements consuming the most resources -- (normally, 30 for high-resource Snapshots will be retained forever (MAX_RETENTION, or 100 years) --- BEGIN DBMS_WORKLOAD_REPOSITORY.MODIFY_SNAPSHOT_SETTINGS( interval => DBMS_WORKLOAD_REPOSITORY.MAX_INTERVAL ,retention => DBMS_WORKLOAD_REPOSITORY.MAX_RETENTION ,topnsql => 'DEFAULT' ); END; / ----- -- Bring database into ARCHIVELOG mode and turn on Flashback Logging ----- SHUTDOWN IMMEDIATE; STARTUP MOUNT; ALTER DATABASE ARCHIVELOG; ALTER DATABASE FLASHBACK ON; ALTER DATABASE OPEN; ----- -- Create the GOLDENDB guaranteed restore point ----- DROP RESTORE POINT goldendb; CREATE RESTORE POINT goldendb GUARANTEE FLASHBACK DATABASE;
Starting workload capture
To capture the simulated workload, I’ll use the code in 11gR2_DBR_Capture.sql to start the capture via Oracle 11gR2 Database Replay. This code first erases any prior like-named workload captures and then starts a new capture named DBR_CAPTURE_100.
Generating the application workload
To generate the workload, I’ll use the code in shell script LG-RandomWorkload.sh to kick off 100 concurrent application user sessions, split between approximately 60% OLAP/DSS processing and 40% OLTP processing. I’ll use Oracle 11gR2 Enterprise Manager (EM) to monitor the workload as it progresses towards completion. One nice enhancement in this release of EM is an indicator in the top right hand corner of each tab (see Figure 7.1).
Figure 7.1: Oracle 11gR2 Enterprise Manager: Workload Capture Active Indicator
Completing workload capture
Once the workload completes, I’ll conclude its capture from within EM’s Database Workload Capture panel. I’ll also acknowledge and accept the prompt to create a DataPump export of the AWR snapshot that encapsulates the captured workload. This permits direct comparison of the captured workload’s AWR performance snapshot with any future DBR replay performance snapshot(s).
Finding room for improvement
As I showed in the prior article in this series, my next step is to take a look at the AWR reports for the workload that I’ve captured via Database Capture to determine if there are any candidate areas that I should focus my attention on for possible improvements to my database’s I/O performance.
Reviewing the captured workload’s performance statistics
As I expected, the datafiles for the AP.INVOICES and AP.INVOICE_ITEMS data segments stored in the AP_DATA tablespace and those tables’ corresponding index segments stored in the AP_IDX tablespace showed the highest utilization, with all the classic hallmarks of an intense hybrid OLTP / DSS environment. Since this simulated workload also uses data from tables in the SH schema, the EXAMPLE tablespace’s datafile is also heavily utilized. The Segment Statistics section of the AWR report for the captured workload clearly show this. In addition, the Tablespace IO Stats and File IO Stats sections of the same AWR report shows that the AP_DATA, AP_IDX, and EXAMPLE tablespaces as well as their corresponding datafiles are among the most utilized during the workload.
The first thing I can do to improve the database’s I/O performance is to shift these datafiles to faster media or, barring that possibility, at least shift those datafiles to media that’s isolated from the other tablespaces that participate in the workload. Fortunately, I’ve got an excellent trick up my sleeve as an Oracle DBA because it’s a relatively simple matter to migrate my database’s “hottest” data files to a file system that’s uniquely constructed for effective database I/O: Oracle 11gR2 Automatic Storage Management (ASM).
Replaying a captured workload
Now that I’ve captured a repeatable workload using Oracle 11gR2’s Real Application Testing suite, I’m ready to modify my database’s underlying storage system to see if I can beneficially improve the database’s overall I/O performance. The simplest change will be to migrate the datafiles for the AP_DATA and AP_IDX tablespaces from a traditional file system (in this case, EXT3) to a file system that’s specially designed for Oracle database I/O: Automatic Storage Management (ASM).
My database already had an ASM disk group, +DATA, but I’ve made sure that this disk group’s underlying ASM disks are on a different HDD separate from the original datafiles for the AP_DATA, AP_IDX and EXAMPLE tablespaces. (Of course, if this were a production environment I’d have to enlist the help of my storage administrator to find out if these HDD disk partitions were actually on a different physical device, but since I’m wearing both “hats” in my case, I know that these partitions were created on the proper physical device.) Before I migrate the tablespaces to ASM, however, I’ll first “rewind” my database to its prior guaranteed restore point:
SQL> SHUTDOWN IMMEDIATE; SQL> STARTUP MOUNT; SQL> FLASHBACK DATABASE TO RESTORE POINT GOLDENDB; SQL> ALTER DATABASE OPEN RESETLOGS;
Migrating the tablespaces to ASM
To migrate the two AP tablespaces to ASM storage, I’ll first use Oracle 11gR2 Recovery Manager (RMAN) commands to create an image copy backup of each tablespace’s datafiles to the +DATA disk group: Listing 7.1shows the RMAN command script and the result of the successful backup operation. Note that these three tablespaces remained online and continued to accept transactions while they were being backed up, so there is no disruption to application user activity.
Once the image copy backups are created, I’ll take the three tablespaces offline while I switch them to their new destination. Obviously, in a production environment, I’d need to perform this intra-database “switchover” during an off-peak period to minimize disruption to my application user community. But the good news here is that these tablespaces would be unavailable only for the time it would take RMAN to apply any pending redo log transactions for those tablespaces.