Oracle Database 11g: Database Replay, Part 1
April 30, 2008
Synopsis. Oracle DBAs have long wished to be able to capture applications workloads in a production environment and then replay that workload against a test environment to determine the impact of proposed database or application changes on database performance. Oracle Database 11gR1s new Database Replay feature enables an Oracle DBA to capture, massage, and then replay workloads either selectively or in their entirety across a wide range of database environments and platforms. This article provides a primer for using Oracle 11g Database Replay to effectively predict the performance of rapidly changing applications within increasingly fluid database environments.
If my time served in information technology has taught me anything, its the utter truth behind Murphys Law: Anything that can go wrong will go wrong. Over the years, Ive also learned the veracity of several corollaries to Murphys Law, including the fact that interchangeable parts wont and especially the rejoinder that Murphy was actually an optimist. I hope to add my own corollary to these grim testimonials one day with the following observation obtained at great pain: Nothing works in production like it does in the test environment.
All attempts at ill humor aside, Oracle DBAs face a severe challenge on a regular basis: to predict accurately how the next set of changes to the databases application code, database patch set, or hardware configuration will affect negatively the entire database environment. When I say the entire environment, I do mean that literally: every SQL statement that any running application has executed, regardless if its just a simple query or a complex batch job that issues a huge volume of DML statements, must be captured. This challenge has become even more acute because todays typical application workloads are generated across a landscape that spans myriad technologies: n-tier application servers, web farms, and even client-server legacy applications. Moreover, when an application appears to be performing poorly, the root cause of a detected performance issue is often nearly impossible to track down; it may be the indirect result of an incorrect network configuration, the application servers configuration, or even the application clients environment variable settings.
The only chance to accomplish this lofty yet crucial goal is via an application workload capture-and-playback product suite. This type of application is specifically designed to capture the complete workload thats been executed against the current production database environment (P+0) and then play back precisely that same workload against the next iteration of the production environment (P+1). However, my past experience has shown this means asking upper management to open the company checkbook to purchase a relatively expensive third-party solution (e.g. HPs LoadRunner tool). The cost of licensing this software, the cost of procuring and configuring the servers it runs upon, and most especially the manpower expense to configure and prove out the capture-and-playback suite even before workloads can be captured and replayed may easily run into the six-figure USD range. And thats why many IT organizations have surrendered to the perception that its simply impossible to test for application system regression in a cost-effective manner.
Beyond Performance Regression
My prior article series about Oracle Database 11gs new SQL statement performance tuning features SQL Performance Analyzer (SPA) and SQL Plan Management (SPM) discussed how Oracle 11g allows the DBA to easily isolate SQL statements whose performance has either progressed, remained unchanged, or regressed after the application environment has been changed. Any robust capture and playback suite must certainly capture and compare the differences between the original (P+0) versus the future (P+1) system, application, and database performance statistics, especially SQL statements that are now performing dramatically worse. However, there are two other types of regression that any robust capture-and-playback suite needs to address as well:
Error Regression. When a captured workload is being played back, its not unusual to encounter errors, and in fact this is almost certainly a desirable outcome. For example, Id like to verify that an expected exception like the violation of referential integrity (i.e. violation of a primary key, foreign key, unique key, CHECK, or NOT NULL constraint) actually arises and is trapped correctly. Also, Id like to insure that an appropriate and expected exception is raised whenever a crucial business rule is violated. For example, if during payroll check processing an employees net salary doesnt balance to gross pay less all deductions, Id like that exception to be trapped as an error.
Therefore, any robust capture and playback suite must be capable of monitoring three types of error regression:
Data Regression. Any robust capture and playback suite also must pay strict attention to any variance in the data itself after the playback has completed. For example, if Im testing a mission-critical financial system, I must insure that the identical financial transactions have completed in the appropriate order and all accounts total to the same balances in the P+1 environment as they would in the P+0 environment. If I dont receive identical results, I must assume that something in my application, database, or environment has changed to prevent a perfect playback.
Another crucial feature of the capture and playback suite: It must be able to insure that the captured workload is played back against the P+1 environment only after that environment has been reset to reflect the state of the P+0 environment at the time workload capture was initiated. Otherwise, theres a chance that false positive indications of data regression will be detected that have nothing to do with changes to the application, database, or environment.
The metaphor of a player piano roll especially appeals to me when I imagine how a workloads individual, dependent transactions need to be captured and then played back: Once the pianist has finished playing a piece of music, the piano roll has not only captured the exact notes performed, but also the exact frequency at which individual keys were struck. In essence, it gave the listener an exact duplicate that encapsulates the maestros playing style, including all of her performances subtle pauses. (Humorous aside to readers younger than 25: If youve never seen a player piano roll, please substitute MP3 or even WAV file or ask one of your older co-workers to explain how things were back in the old days.)
Database Replay: Features Summary
Thankfully, Oracle 11gs new Database Replay (DBR) suite offers all of the features Ive just described. DBR permits an Oracle DBA to:
The beauty of Database Replay is that it eliminates the necessity to fashion simulated workloads to perform regression analysis. Instead, the DBA can execute precisely the same SQL statements that have been recorded, so this tends to provide a more accurate picture of system regression because other extraneous factors (e.g. network latency) are reduced or eliminated. Best of all, since all the recorded SQL statements that comprise the workload are played back, theres virtually no chance that even seemingly insignificant or rarely executed code will be ignored as a possible vector for regression. This can be crucial in determining if an application will scale properly in a Real Application Clusters (RAC) clustered database environment.
The next four sections of this article provide a high level primer on how Database Replay features interact to accomplish their common goal: an accurate determination of just how much regression can be expected when migrating a production system from P+0 to P+1. In the next articles in this series, Ill demonstrate how to utilize Database Replay to capture, preprocess, replay, and analyze results from a DBR operation.
Phase 1: Recording a Workload
Oracle 11g Enterprise Manager Database Control provides an intuitive interface to Database Replay features that enables capture, massage, replay, and regression analysis of a workload for performance, error, and data regression. The interface also provides excellent feedback on the status of each phase of a Database Replay task set. Figure 1 below shows the initial screen thats displayed when a Database Replay operation is initiated by selecting the Database Replay option from the Real Application Testing section of EMs Software and Support tab:
Figure 1: Database Replay Master Panel.
During this phase of a Database Replay operation, the complete workload as seen from the perspective of the production database is captured and recorded. The DBA simply needs to insure that there is a sufficiently interesting (i.e. pertinent) workload running on the production system; DBR does the rest by capturing all executing SQL statements issued from all external clients. This includes:
Note that while the DBR capture operation is running, Oracle 11g doesnt halt any running background jobs; any internal clients continue making requests as well.
DBR records the workload via a series of shadow processes. These shadow processes then filter the necessary information to precisely reproduce the system workload, and finally write this metadata into a series of XML files one set of files for each process -- for processing and playback at a later time. About the only concern an Oracle 11g DBA might have, is whether there is sufficient space on the file system for the numerous XML files that are written during this process.
Phase 2: Massaging the Workload
Once a DBR workload has been recorded, it almost certainly will need some slight adjustments before its replayed. For example, it may be necessary to remap certain external client connections so they can be replayed accurately in the P+1 environment. During this phase, DBR prepares workload-specific metadata for its eventual replay, and any parameters that might affect the outcome of the replay can be modified at this stage.
This preprocessing must occur on the same database version as the replay, but as long as the database versions match, the scrubbing operation can be performed on a production, test, or other database system. In fact, Oracle highly recommends performing this metadata massaging on a host other than the production server so as not to affect the health or performance of that server.
Phase 3: Replaying the Workload
Now that the workload has been massaged, its time to initiate its replay. Once the metadata scrubbing described above is done, designated DBR replay client(s) can replay the workload as often as analyses are required.
Resetting the Test Environment. Before any replays are initiated, however, the Oracle DBA must first reset the target database and host environment being used for testing, because its vital that the test server matches the production server before any changes are applied; otherwise, unexpected regression may occur (a.k.a. a false positive). Fortunately, this is much simpler to perform with the advent of FLASHBACK DATABASE features in Oracle 10g. Other alternatives include normal point-in-time incomplete recovery via RMAN, or even DataPump Export and Import. Once the testing environment has been reset properly, the Oracle DBA next applies all proposed changes to the production system on the test server so that it is now in state P+1, and then she transfers the previously captured workload to the P+1 server.
Replaying the Workload via the Replay Driver. When its finally time to replay the previously captured workload against the P+1 server, a special application called the Replay Driver begins sending requests to the target RDBMS. Since the Replay Driver is client-agnostic, it makes no difference to Oracle 11g as to what type of client was sending the requests initially. The Replay Driver consumes the recorded workload and sends appropriate requests to the P+1 system so that it behaves as if external clients were actually issuing the requests.
Since it will distribute all workload capture streams between all replay clients, the Replay Driver can take into account network bandwidth, CPU, and memory capacity. The Replay Driver can also take advantage of remapped connection strings so that they correlate one-to-one (i.e. single-instance to single-instance) or many-to-one (i.e. single node to several Real Application Cluster nodes), and that means that connection load balancing can be taken into account. Just as important, the Replay driver will ignore any activity that was originally generated by internal clients (e.g. EM Database Control) and simply not replay that activity. It will also ignore any recorded activity that utilized connections to external databases via database links or accessed directory objects.
One other intriguing advantage of using Database Replay: A captured workload can be played back in either synchronous or asynchronous replay mode. In synchronous mode, each transaction is replayed in exactly the order that it was recorded; however, DBR can also replay a workload back in asynchronous mode, i.e. without regard to transaction synchronicity, so that a heavier-than-recorded workload can be generated. This is an especially useful feature when attempting to perform a test to destruction of a new or modified database environment.
DBR Workload Replay Scope. Oracle 11gR1 Database Replay can accurately evaluate the following types of changes to a database environment during the workload replay phase:
DBR Workload Replay Limits. Database Replays simulation abilities do have some noteworthy (and justifiable) limitations:
For the most part, these limitations do make sense. For example, a Flashback Database operation is essentially an incomplete recovery of the database, so its not part of normal transaction processing, and I certainly wouldnt be concerned about whether its performance has regressed. While the limitation against Shared Server sessions does make sense, there are still a few database shops out there that do utilize Shared Server for connection pooling, so this is a minor disappointment.
Phase 4: Regression Analysis
After the workload replay is completed, Database Replay provides several different analyses of how the replayed workloads performance varied in the P+1 environment versus its original performance in the P+0 environment. As I discussed at the beginning of this article, any good regression testing suite has the capability to trap and analyze performance regression, data regression, and error regression, and DBR definitely doesnt disappoint on these features.
For example, DBR can detect immediately any performance differences at a summary level via its set of Capture and Replay reports. From these reports, its possible to drill down into detailed analysis of database performance stored within Automatic Database Diagnostic Monitor (ADDM) runs, Automatic Workload Repository (AWR) reports, and Active Session History (ASH) reports.
Whatever the source of the divergence, DBR post-replay analysis can identify and handle two different flavors of divergence:
Enough theory! Any tool this elegant deserves a significant evaluation, so in the next article in this series, Ill demonstrate how to:
References and Additional Reading
While Im hopeful that Ive given you a thorough grounding in the technical aspects of the features Ive discussed in this article, Im also sure that there may be better documentation available since its been published. I therefore strongly suggest that you take a close look at the corresponding Oracle documentation on these features to obtain crystal-clear understanding before attempting to implement them in a production environment. Please note that Ive drawn upon the following Oracle Database 11gR1 documentation for the deeper technical details of this article:
B28274-01 Oracle Database 11gR1 Performance Tuning Guide
B28279-02 Oracle Database 11gR1 New Features Guide