Synopsis. Oracle DBAs have long wished to be able to capture applications’ workloads in a production environment and then replay that workload against a test environment to determine the impact of proposed database or application changes on database performance. Oracle Database 11gR1’s new Database Replay feature enables an Oracle DBA to capture, massage, and then replay workloads either selectively or in their entirety across a wide range of database environments and platforms. This article provides a primer for using Oracle 11g Database Replay to effectively predict the performance of rapidly changing applications within increasingly fluid database environments.
If my time served in information technology has taught me anything, it’s the utter truth behind Murphy’s Law: “Anything that can go wrong will go wrong.” Over the years, I’ve also learned the veracity of several corollaries to Murphy’s Law, including the fact that “interchangeable parts won’t” and especially the rejoinder that “Murphy was actually an optimist.” I hope to add my own corollary to these grim testimonials one day with the following observation obtained at great pain: “Nothing works in production like it does in the test environment.”
All attempts at ill humor aside, Oracle DBAs face a severe challenge on a regular basis: to predict accurately how the next set of changes to the database’s application code, database patch set, or hardware configuration will affect negatively the entire database environment. When I say the entire environment, I do mean that literally: every SQL statement that any running application has executed, regardless if it’s just a simple query or a complex batch job that issues a huge volume of DML statements, must be captured. This challenge has become even more acute because today’s typical application workloads are generated across a landscape that spans myriad technologies: n-tier application servers, web farms, and even client-server legacy applications. Moreover, when an application appears to be performing poorly, the root cause of a detected performance issue is often nearly impossible to track down; it may be the indirect result of an incorrect network configuration, the application server’s configuration, or even the application client’s environment variable settings.
The only chance to accomplish this lofty yet crucial goal is via an application workload capture-and-playback product suite. This type of application is specifically designed to capture the complete workload that’s been executed against the current production database environment (P+0) and then play back precisely that same workload against the next iteration of the production environment (P+1). However, my past experience has shown this means asking upper management to open the company checkbook to purchase a relatively expensive third-party solution (e.g. HP’s LoadRunner tool). The cost of licensing this software, the cost of procuring and configuring the servers it runs upon, and most especially the manpower expense to configure and prove out the capture-and-playback suite even before workloads can be captured and replayed may easily run into the six-figure USD range. And that’s why many IT organizations have surrendered to the perception that it’s simply impossible to test for application system regression in a cost-effective manner.
Beyond Performance Regression
My prior article series about Oracle Database 11g’s new SQL statement performance tuning features – SQL Performance Analyzer (SPA) and SQL Plan Management (SPM) – discussed how Oracle 11g allows the DBA to easily isolate SQL statements whose performance has either progressed, remained unchanged, or regressed after the application environment has been changed. Any robust capture and playback suite must certainly capture and compare the differences between the original (P+0) versus the future (P+1) system, application, and database performance statistics, especially SQL statements that are now performing dramatically worse. However, there are two other types of regression that any robust capture-and-playback suite needs to address as well:
Error Regression. When a captured workload is being played back, it’s not unusual to encounter errors, and in fact this is almost certainly a desirable outcome. For example, I’d like to verify that an expected exception like the violation of referential integrity (i.e. violation of a primary key, foreign key, unique key, CHECK, or NOT NULL constraint) actually arises and is trapped correctly. Also, I’d like to insure that an appropriate and expected exception is raised whenever a crucial business rule is violated. For example, if during payroll check processing an employee’s net salary doesn’t balance to gross pay less all deductions, I’d like that exception to be trapped as an error.
Therefore, any robust capture and playback suite must be capable of monitoring three types of error regression:
- Have all expected errors occurred?
- Next, have any error conditions been raised that were not expected? This obviously indicates that severe error regression is possible because of system or application changes.
- Finally, have any expected errors not occurred? This condition is much more subtle, because it indicates that something sinister has changed within the system or application, and crucial business rules are either being misapplied or not applied at all.
Data Regression. Any robust capture and playback suite also must pay strict attention to any variance in the data itself after the playback has completed. For example, if I’m testing a mission-critical financial system, I must insure that the identical financial transactions have completed in the appropriate order and all accounts total to the same balances in the P+1 environment as they would in the P+0 environment. If I don’t receive identical results, I must assume that something in my application, database, or environment has changed to prevent a perfect playback.
Another crucial feature of the capture and playback suite: It must be able to insure that the captured workload is played back against the P+1 environment only after that environment has been reset to reflect the state of the P+0 environment at the time workload capture was initiated. Otherwise, there’s a chance that “false positive” indications of data regression will be detected that have nothing to do with changes to the application, database, or environment.
The metaphor of a player piano roll especially appeals to me when I imagine how a workload’s individual, dependent transactions need to be captured and then played back: Once the pianist has finished playing a piece of music, the piano roll has not only captured the exact notes performed, but also the exact frequency at which individual keys were struck. In essence, it gave the listener an exact duplicate that encapsulates the maestro’s playing style, including all of her performance’s subtle pauses. (Humorous aside to readers younger than 25: If you’ve never seen a player piano roll, please substitute “MP3” or even “WAV file” … or ask one of your older co-workers to explain how things were “back in the old days.”)
Database Replay: Features Summary
Thankfully, Oracle 11g’s new Database Replay (DBR) suite offers all of the features I’ve just described. DBR permits an Oracle DBA to:
- Capture a workload that’s being generated on a production database system. This includes the ability to capture concurrent execution of the same SQL statements across multiple sessions while simultaneously gathering all dependent transactions.
- “Massage” the captured workload before it’s eventually executed against a representative test system. This allows the DBA to adjust the frequency at which the workload is played back, as well as the capability to remap connections to different user sessions, different services, or – in the case of playback on an Oracle 11g Real Application Clusters (RAC) test system – one or more database instances.
- Replay the captured and “massaged” workload on a test system. The test system is configured to conform to the P+1 configuration so that the DBA can accurately determine exactly how the workload will react to any proposed system changes, including application changes, software patches, and even hardware upgrades. The target for the test system can be a test or QA database environment, or even a snapshot standby database. (More on this last item in later articles.)
- Perform regression analysis to highlight any differences that exist between the P+0 and P+1 simulated workload. DBR will automatically identify and analyze the vectors of any error regression, data regression, or SQL statement regression.
The beauty of Database Replay is that it eliminates the necessity to fashion simulated workloads to perform regression analysis. Instead, the DBA can execute precisely the same SQL statements that have been recorded, so this tends to provide a more accurate picture of system regression because other extraneous factors (e.g. network latency) are reduced or eliminated. Best of all, since all the recorded SQL statements that comprise the workload are played back, there’s virtually no chance that even seemingly insignificant or rarely executed code will be ignored as a possible vector for regression. This can be crucial in determining if an application will scale properly in a Real Application Clusters (RAC) clustered database environment.
The next four sections of this article provide a high level primer on how Database Replay features interact to accomplish their common goal: an accurate determination of just how much regression can be expected when migrating a production system from P+0 to P+1. In the next articles in this series, I’ll demonstrate how to utilize Database Replay to capture, preprocess, replay, and analyze results from a DBR operation.
Phase 1: Recording a Workload
Oracle 11g Enterprise Manager Database Control provides an intuitive interface to Database Replay features that enables capture, massage, replay, and regression analysis of a workload for performance, error, and data regression. The interface also provides excellent feedback on the status of each phase of a Database Replay task set. Figure 1 below shows the initial screen that’s displayed when a Database Replay operation is initiated by selecting the Database Replay option from the Real Application Testing section of EM’s Software and Support tab:
Figure 1: Database Replay Master Panel.
During this phase of a Database Replay operation, the complete workload as seen from the perspective of the production database is captured and recorded. The DBA simply needs to insure that there is a sufficiently “interesting” (i.e. pertinent) workload running on the production system; DBR does the rest by capturing all executing SQL statements issued from all external clients. This includes:
- SQL queries, DML statements, and DDL statements
- PL/SQL blocks and Remote Procedure Calls (RPCs)
- Object Navigation requests and OCI calls
Note that while the DBR capture operation is running, Oracle 11g doesn’t halt any running background jobs; any internal clients continue making requests as well.
DBR records the workload via a series of shadow processes. These shadow processes then filter the necessary information to precisely reproduce the system workload, and finally write this metadata into a series of XML files – one set of files for each process — for processing and playback at a later time. About the only concern an Oracle 11g DBA might have, is whether there is sufficient space on the file system for the numerous XML files that are written during this process.
Phase 2: “Massaging” the Workload
Once a DBR workload has been recorded, it almost certainly will need some slight adjustments before it’s replayed. For example, it may be necessary to remap certain external client connections so they can be replayed accurately in the P+1 environment. During this phase, DBR prepares workload-specific metadata for its eventual replay, and any parameters that might affect the outcome of the replay can be modified at this stage.
This preprocessing must occur on the same database version as the replay, but as long as the database versions match, the scrubbing operation can be performed on a production, test, or other database system. In fact, Oracle highly recommends performing this metadata “massaging“ on a host other than the production server so as not to affect the health or performance of that server.
Phase 3: Replaying the Workload
Now that the workload has been massaged, it’s time to initiate its replay. Once the metadata scrubbing described above is done, designated DBR replay client(s) can replay the workload as often as analyses are required.
Resetting the Test Environment. Before any replays are initiated, however, the Oracle DBA must first “reset” the target database and host environment being used for testing, because it’s vital that the test server matches the production server before any changes are applied; otherwise, unexpected regression may occur (a.k.a. a false positive). Fortunately, this is much simpler to perform with the advent of FLASHBACK DATABASE features in Oracle 10g. Other alternatives include normal point-in-time incomplete recovery via RMAN, or even DataPump Export and Import. Once the testing environment has been reset properly, the Oracle DBA next applies all proposed changes to the production system on the test server so that it is now in state P+1, and then she transfers the previously captured workload to the P+1 server.
Replaying the Workload via the Replay Driver. When it’s finally time to replay the previously captured workload against the P+1 server, a special application called the Replay Driver begins sending requests to the target RDBMS. Since the Replay Driver is client-agnostic, it makes no difference to Oracle 11g as to what type of client was sending the requests initially. The Replay Driver consumes the recorded workload and sends appropriate requests to the P+1 system so that it behaves as if external clients were actually issuing the requests.
Since it will distribute all workload capture streams between all replay clients, the Replay Driver can take into account network bandwidth, CPU, and memory capacity. The Replay Driver can also take advantage of remapped connection strings so that they correlate one-to-one (i.e. single-instance to single-instance) or many-to-one (i.e. single node to several Real Application Cluster nodes), and that means that connection load balancing can be taken into account. Just as important, the Replay driver will ignore any activity that was originally generated by internal clients (e.g. EM Database Control) and simply not replay that activity. It will also ignore any recorded activity that utilized connections to external databases via database links or accessed directory objects.
One other intriguing advantage of using Database Replay: A captured workload can be played back in either synchronous or asynchronous replay mode. In synchronous mode, each transaction is replayed in exactly the order that it was recorded; however, DBR can also replay a workload back in asynchronous mode, i.e. without regard to transaction synchronicity, so that a heavier-than-recorded workload can be generated. This is an especially useful feature when attempting to perform a “test to destruction” of a new or modified database environment.
DBR Workload Replay Scope. Oracle 11gR1 Database Replay can accurately evaluate the following types of changes to a database environment during the workload replay phase:
- Database upgrades
- Database patches
- Changes to database schemas
- Changes to initialization parameters
- Modifications to one or more Real Application Cluster (RAC) nodes as well as their interconnect configurations
- Upgrades to the operating system via OS patches and upgrades
- OS platform modifications, including transitioning from 32-bit to 64-bit environments
- Changes to server memory or CPU configurations
- Modifications to the database’s storage configurations, including migration of database files between a named file system (e.g. EXT3, NTFS), ASM storage, and/or raw storage
DBR Workload Replay Limits. Database Replay’s simulation abilities do have some noteworthy (and justifiable) limitations:
- SQL*Loader direct path loading cannot be replayed; however, conventional path SQL*Loader operations are indeed replayable.
- Import and export operations, whether via traditional or DataPump methods, are not replayable.
- Oracle Shared Server sessions cannot be traced.
- Flashback Database recoveries and Flashback Query operations are not replayable.
- Oracle Streams, including Advanced Queuing (AQ) operations that aren’t PL/SQL based, can’t be replayed.
- Distributed transaction processing, including remote COMMIT operations, is only replayable as local transactions.
- Oracle Call Interface (OCI)-based object navigation isn’t replayable.
For the most part, these limitations do make sense. For example, a Flashback Database operation is essentially an incomplete recovery of the database, so it’s not part of normal transaction processing, and I certainly wouldn’t be concerned about whether its performance has regressed. While the limitation against Shared Server sessions does make sense, there are still a few database shops out there that do utilize Shared Server for connection pooling, so this is a minor disappointment.
Phase 4: Regression Analysis
After the workload replay is completed, Database Replay provides several different analyses of how the replayed workload’s performance varied in the P+1 environment versus its original performance in the P+0 environment. As I discussed at the beginning of this article, any good regression testing suite has the capability to trap and analyze performance regression, data regression, and error regression, and DBR definitely doesn’t disappoint on these features.
For example, DBR can detect immediately any performance differences at a summary level via its set of Capture and Replay reports. From these reports, it’s possible to drill down into detailed analysis of database performance stored within Automatic Database Diagnostic Monitor (ADDM) runs, Automatic Workload Repository (AWR) reports, and Active Session History (ASH) reports.
Whatever the source of the divergence, DBR post-replay analysis can identify and handle two different flavors of divergence:
- Online divergence may indicate that the database replay has badly malfunctioned, and should probably be halted, as the results of the replay are possibly less than meaningful.
- Offline divergence is actually an expected result of a successful database replay operation; this type of divergence is typically detected and measured after the replay is completed.
Enough theory! Any tool this elegant deserves a significant evaluation, so in the next article in this series, I’ll demonstrate how to:
- Capture a simple workload on a single-instance Oracle 11gR1 database
- Massage (i.e. pre-process) that workload for eventual replay
- Replay the massaged workload on a two-node Oracle 11gR1 Real Application Cluster (RAC) clustered database
- Identify possible issues that might arise during transition to a similar target environment
References and Additional Reading
While I’m hopeful that I’ve given you a thorough grounding in the technical aspects of the features I’ve discussed in this article, I’m also sure that there may be better documentation available since it’s been published. I therefore strongly suggest that you take a close look at the corresponding Oracle documentation on these features to obtain crystal-clear understanding before attempting to implement them in a production environment. Please note that I’ve drawn upon the following Oracle Database 11gR1 documentation for the deeper technical details of this article:
B28274-01 Oracle Database 11gR1 Performance Tuning Guide
B28279-02 Oracle Database 11gR1 New Features Guide