Synopsis.
One of the best ways to minimize unexpected issues with any Oracle database is to know exactly what effect changes to existing application code or database objects will have on the production environment.
I spent the weekend between Christmas and New Year’s Day last year converting our production databases to Oracle 9i Release 2 from Oracle 8i. I’d done everything to insure the conversion would be successful, including making several “dry runs” with an almost-identical version of the production environment’s hardware and software. The database shut down perfectly; the new 9iR2 software was installed without a hitch; and it was time to restart the database in STARTUP MIGRATE mode to complete the conversion.
Except the database wouldn’t start! The server kept insisting that there was insufficient memory to start the instance. Since I originally had to experiment with a slightly higher value for the Windows NT AWE_WINDOW_MEMORY setting to get the instance to start during my “dry runs,” I knew the solution was a combination of proper settings for SHARED_POOL_SIZE, the size of the database buffer cache, and that setting. I experimented with several values for all three parameters, all without success.
Luckily, my boss was on hand to help me think through the problem. His first question was, “Well, what’s different between the database server you used for the dry runs and the production server?” My immediate answer: “Nothing! We’ve got an exact copy of the database schema; the pool sizes are identical; the database buffer cache is the same size; and we’ve got identical amounts of memory (8GB) on the server. And I swear this worked during yesterday’s test run, Chief, I swear it!”
After two hours of fruitless attempts, I took a coffee break. As I slowly walked back to my desk, replaying the issues in my head, it hit me. My boss looked up and saw the sheepish grin on my face. “I believe I’ve figured it out,” I said. “There is just one difference between our production and test environments: the number of CPUs!” The production server had 8 CPUs, while the test server had 4. So I doubled the size of AWE_WINDOW_MEMORY, lowered the size of the database buffer cache to accommodate it, and the instance started. The conversion was completed only a few hours late, but well within our expected maintenance window.
It turns out that there’s a hidden parameter (_DBU_LRU_LATCH_SIZE) that under Oracle 9iR2 is now computed automatically by the database. This parameter determines the minimum necessary size for the “working set” memory required for the instance. A few days later I had a complete explanation from Oracle Technical Support on how to tune this parameter so that I could resize the database buffer cache, and now everything’s copasetic.
However, the proper setting for AWE_WINDOW_MEMORY isn’t the point here. This incident reminded me of a much more important DBA “fact of life”: It’s impossible to predict with certainty how any application software will perform in production unless the environment used to evaluate that software matches production identically. If only I’d had the same number of CPUs in my quality assurance server, I seriously doubt that I would have missed the problem.
What would the perfect environment for evaluating software changes before they are run against the production Oracle database look like? Here are my suggestions based on my experience as a DBA and application developer over the past 20 years:
Development database(s)
How do developers test their changes, and how should DBAs experiment with new features, without affecting production? I have found that setting up at least one “sandbox” database for development works extremely well. It is helpful if the data stored in this “sandbox” is just large enough for evaluating application features and database changes, but not so large that it takes forever to be refreshed from production data sources if necessary.
Note that this “sandbox” doesn’t contain a complete copy of the production database; it may be as small as a 1% sample. And I’ve made sure my developers know that since this database is used only for software evaluation, it is not backed up for recovery in case someone deletes someone else’s data or objects – it’s their sandbox, and it’s their responsibility to keep copies of data they need or database objects they’re going to migrate to the next stage of development.
Regression testing (RT) database
How do you insure that upcoming application software changes don’t negatively affect existing production processes? One of the best ways is to test the latest version of the application against a representative sample of test cases. And that’s where a regression testing (RT) database could be invaluable. Ideally, the RT database would be relatively small, since only the necessary mission-critical test cases would be stored here.
And yes, someone – ideally, the QA department of your IT team, if you have one – has to keep the regression testing data up to date to match the mission-critical test cases. Finally, mechanisms to reset the RT data for the next run of regression tests have to be established. (I’ve found that restoring a “cold” backup of the database containing the most recent set of regression testing data is the easiest way to “reset.”)
Quality assurance (QA) database and server
This is where testing is performed against all changes to application software and database objects. Essentially, no code should be moved to the production environment until it’s been tested against the QA database and server. This environment should essentially represent “Production+1”: in other words, it should reflect exactly how the database will be configured when the next release of the application rolls out, or when the next Oracle database patch is applied. This allows for isolation of the root causes of problems with software or hardware.
Ideally, this database should contain as much of the production database’s data as feasible, and the QA server’s hardware should also match the production database server’s as closely as possible. In addition, there’s an added benefit, especially for large queries or long-running database operations: Accurate timings are possible because virtually identical query plans can be generated and measured.
Of course, in the real world, business limits do exist. It may not be cost-effective to maintain a quality assurance server that mimics the production server, especially if either the disk space or server capacity required become cost-prohibitive. Also, configuration management issues are not trivial – the proper version of database objects must be promoted from development to the RT and QA databases. On the other hand, the costs related to setting up and maintaining this environment can often be justified easily by measuring the corresponding costs of an application software failure or (worst case) a database crash requiring complete disaster recovery of the database and/or server.
Jim Czuprynski is an Oracle DBA for a telecommunications company in Schaumburg, IL. He can be contacted at jczuprynski@zerodefectcomputing.com.