Don't Just Do Something, Stand There! Avoiding Junior DBA Mistakes - Page 2
April 15, 2004
Why Aren't The Shared Servers Sharing?
My final example deals with what at first appeared to be a catastrophic database failure. Like most American companies, our transaction processing volumes vary throughout the month. Our peak usually occurs during the two days before and after the end of the month when our Accounting department completes its billing cycles for most customers.
Late in the afternoon in the midst of a recent month-end billing cycle, our production database's alert log began to spew out messages relating to failed shared servers, and response time for both OLTP and decision support applications began to degrade seriously. My immediate reaction was to check the hardware in the server room. I found no unexpected problems - all storage media was online, CPU utilization was moderately elevated, and memory utilization was relatively normal.
However, that did not explain the slowdown, so I re-examined the alert and user dump logs. I found that in only five minutes, there had been two deadlocks, both caused by the same pair of users. When we checked these two users' workstations, we found a fatal error message on one of their terminals - one of the users had left it on his screen and simply gone on break without informing anyone of the problem. It turned out that the application he was using had a fatal flaw that had gone unnoticed until then: It was allowing him to post payments for invoices that were still being generated by his teammate.
The end result of the two deadlocks? Oracle brought online several dozen UNDO segments to handle the rollback of the massive transactions that had failed. Since we use automatic UNDO management, Oracle handled this without DBA intervention. However, it took some time to bring all those UNDO segments online, and Oracle diverted almost all of the server's processing power for approximately two minutes to handle this.
So, what caused the shared server failures? We ascertained that our Sales department had just finished a major sales contest, and several dozen salespeople were rabidly attempting to run reports on their month-end sales figures. Unfortunately, I found out later that most of these reports had been written against "live" database tables - the same tables that were affected by the aforementioned deadlock. In addition, the reporting application that the salespeople were running used shared server connections to run these reports.
A final piece of the puzzle fell into place the next day. It turned out a large number of reports had been migrated from our order entry application (which uses dedicated database connections) to their rightful place in our decision-support application (which uses shared server connections), and that migration had happened just before the monthly billing period had commenced. We found out that many of the migrated reports had not been fully tested in shared server mode and contained horrendous query plans that used large hash joins to return result sets. Since Oracle Shared Server manages UGA memory differently than for dedicated connections and usually allocates relatively small memory areas for sorts and hash joins, the reports were taking an enormous time to complete.
In the midst of all this turmoil, one of my colleagues kept insisting that we should just shut down the database server to clear up all these issues and pick up the pieces later. I was able to convince him that while yes, response time is slower, and yes, some of our internal clients are affected, what we really needed to do was to monitor, measure, and selectively investigate for the root cause. Moreover, had I not done so, I would have never gotten the data I needed to forestall the next failure. I am happy to report the end result: We corrected the application logic error that caused the deadlocks, moved some of the reporting application users to dedicated connections instead of shared, and sharpened our measuring tools.
I have woven these tales of DBA combat not to frighten or impress anyone, but merely to illustrate what I have learned during my moments of sheer terror. Here is a quick summation:
Know your hardware's capabilities as intimately as possible. In my second example, I would have saved myself an enormous amount of anxiety if I had understood exactly how my hardware worked. If your shop allows it, I strongly recommend getting an understanding of the hardware and its capabilities. Since media failures are the most vexing of any data recovery effort, I tend to concentrate on the media storage side of the equation, but understanding all aspects are just as important to the survivability, recoverability, and restartability of the database server.
Review and test your disaster recovery plan. As often as possible, review your disaster recovery plan, especially when you have either just experienced a combat moment or whenever there have been significant changes to your database - structural, size, and so forth. (You don't have a disaster recovery plan? No problem. See my previous article on Becoming A Master Of Disaster for what you need to know about disaster recovery plans.) And don't forget that a disaster recovery plan that is untested is just a checklist.
Use "failures" to sharpen your failure detection tools - and trust those tools. One of my close friends is a private pilot, and he has been in some tough situations while flying - bad weather, engine problems, and even getting lost late at night. When I asked him what the most important thing to remember during situations, he smiled and said: "Trust your instruments." No matter what tool set you use to monitor your databases - home-grown SQL queries, Oracle Enterprise Manager, or even those expensive third-party packages - be sure that you understand what those "instruments" are telling you, especially during a system slowdown or an impending failure. Above all, resist the urge to pull back on the stick, even if you hear a passenger screaming - the airplane may not be heading for the ground after all!