Why Aren't The Shared Servers Sharing?
My final example deals with what at first appeared to be
a catastrophic database failure. Like most American companies, our transaction
processing volumes vary throughout the month. Our peak usually occurs during
the two days before and after the end of the month when our Accounting
department completes its billing cycles for most customers.
Late in the afternoon in the midst of a recent month-end
billing cycle, our production database's alert log began to spew out messages
relating to failed shared servers, and response time for both OLTP and decision
support applications began to degrade seriously. My immediate reaction was to
check the hardware in the server room. I found no unexpected problems - all
storage media was online, CPU utilization was moderately elevated, and memory
utilization was relatively normal.
However, that did not explain the slowdown, so I
re-examined the alert and user dump logs. I found that in only five minutes,
there had been two deadlocks, both caused by the same pair of
users. When we checked these two users' workstations, we found a fatal error
message on one of their terminals - one of the users had left it on his screen
and simply gone on break without informing anyone of the problem. It turned out
that the application he was using had a fatal flaw that had gone unnoticed
until then: It was allowing him to post payments for invoices that were still
being generated by his teammate.
The end result of the two deadlocks? Oracle brought
online several dozen UNDO segments to handle the rollback of the massive
transactions that had failed. Since we use automatic UNDO management, Oracle
handled this without DBA intervention. However, it took some time to bring all
those UNDO segments online, and Oracle diverted almost all of the
server's processing power for approximately two minutes to handle this.
So, what caused the shared server failures? We
ascertained that our Sales department had just finished a major sales contest,
and several dozen salespeople were rabidly attempting to run reports on their
month-end sales figures. Unfortunately, I found out later that most of these
reports had been written against "live" database tables - the same tables that
were affected by the aforementioned deadlock. In addition, the reporting
application that the salespeople were running used shared server connections to
run these reports.
A final piece of the puzzle fell into place the next day.
It turned out a large number of reports had been migrated from our order entry
application (which uses dedicated database connections) to their rightful place
in our decision-support application (which uses shared server connections), and
that migration had happened just before the monthly billing period had
commenced. We found out that many of the migrated reports had not been fully
tested in shared server mode and contained horrendous query plans that used
large hash joins to return result sets. Since Oracle Shared Server manages UGA
memory differently than for dedicated connections and usually allocates
relatively small memory areas for sorts and hash joins, the reports were taking
an enormous time to complete.
In the midst of all this turmoil, one of my colleagues
kept insisting that we should just shut down the database server to clear up
all these issues and pick up the pieces later. I was able to convince him that
while yes, response time is slower, and yes, some of our internal clients are
affected, what we really needed to do was to monitor, measure, and
selectively investigate for the root cause. Moreover, had I not done so, I
would have never gotten the data I needed to forestall the next failure. I am
happy to report the end result: We corrected the application logic error that
caused the deadlocks, moved some of the reporting application users to
dedicated connections instead of shared, and sharpened our measuring tools.
Lessons Learned
I have woven these tales of DBA combat not to frighten or
impress anyone, but merely to illustrate what I have learned during my moments
of sheer terror. Here is a quick summation:
Know your hardware's capabilities as intimately as
possible. In my second example, I would have saved myself an enormous
amount of anxiety if I had understood exactly how my hardware worked. If your
shop allows it, I strongly recommend getting an understanding of the hardware
and its capabilities. Since media failures are the most vexing of any data
recovery effort, I tend to concentrate on the media storage side of the
equation, but understanding all aspects are just as important to the
survivability, recoverability, and restartability of the database server.
Review and test your disaster recovery plan. As
often as possible, review your disaster recovery plan, especially when you have
either just experienced a combat moment or whenever there have been significant
changes to your database - structural, size, and so forth. (You don't have a
disaster recovery plan? No problem. See my previous article on Becoming
A Master Of Disaster for what you need to know about disaster recovery
plans.) And don't forget that a disaster recovery plan that is untested is just
a checklist.
Use "failures" to sharpen your failure detection tools
- and trust those tools. One of my close friends is a private pilot,
and he has been in some tough situations while flying - bad weather, engine
problems, and even getting lost late at night. When I asked him what the most
important thing to remember during situations, he smiled and said: "Trust
your instruments." No matter what tool set you use to monitor your
databases - home-grown SQL queries, Oracle Enterprise Manager, or even those
expensive third-party packages - be sure that you understand what those
"instruments" are telling you, especially during a system slowdown or an
impending failure. Above all, resist the urge to pull back on the stick, even
if you hear a passenger screaming - the airplane may not be heading for the
ground after all!
»
See All Articles by Columnist Jim Czuprynski