Expert One-on-One: Oracle: Pt. 2

I was working on a project recently where they decided to use
only the latest, greatest technologies: everything was coded in
Java with EJBs. The client application would talk to the database
server using beans — no Net8. They would not be passing SQL
back and forth between client and server, just EJB calls using
Remote Method Invocation (RMI) over Internet Inter-Orb Protocol
(IIOP).

If you are interested in the details of RMI over
IIOP you can refer to


http://java.sun.com/products/rmi-iiop/
.

This is a perfectly valid approach. This functionality works and
can be extremely scalable. The people responsible for the
architecture understood Java, EJBs, the protocols involved
— all of that stuff. They felt they were in a strong
position to successfully build such a project. When their
application would not scale beyond a couple of users they decided
that the database was at fault and severely doubted Oracle’s
claim to be the ‘most scaleable database ever’.

The problem was not the database but a lack of knowledge of how
the database worked — a lack of knowledge that meant that
certain key decisions were made at design time that doomed this
particular application to failure. In order to deploy EJBs in the
database Oracle must be configured to run in MTS mode rather than
dedicated server mode. What the team did not understand,
crucially, was how using MTS with EJBs in the database would
affect them. Without this understanding, and without a general
knowledge of how Oracle worked, two key decisions were made:

  • We will run some stored procedures that take 45 seconds or
    longer (much longer at times) in our beans.
  • We will not support the use of bind variables. All of our
    queries will hard code the constant values in the predicate. All
    inputs to stored procedures will use strings. This is ‘easier’
    than coding bind variables.

These two seemingly minor decisions guaranteed that the project
would fail — utterly guaranteed it. They made it so that a
highly scalable database would literally fall over and fail with
a very small user load. A lack of knowledge of how the database
worked more then overwhelmed their intricate knowledge of Java
beans and distributed processing. If they had taken time to learn
a bit more about the way Oracle worked, and consequently followed
the following two simple guidelines, then their project would
have had a much better chance of success the first time out.

Do not run Long Transactions Under MTS

The decision to run 45+ second transactions under MTS betrayed a
lack of understanding of what MTS was designed to do and how it
works in Oracle. Briefly, MTS works by having a shared pool of
server processes that service a larger pool of end users. It is
very much like connection pooling — since process creation
and management are some of the most expensive operations you can
ask an operating system to perform, MTS is very beneficial in a
large-scale system. So, I might have 100 users but only five or
ten shared servers.

When a shared server gets a request to run an update, or execute
a stored procedure, then that shared server is dedicated to that
task until completion. No one else will use that shared server
until that update completes or that stored procedure finishes
execution. Thus, when using MTS your goal must be to have very
short statements. MTS is designed to scale up On-Line Transaction
Processing (OLTP) systems — a system characterized by
statements that execute with sub-second response times. You’ll
have a single row update, insert a couple of line items, and
query records by primary key. You won’t (or shouldn’t) run a
batch process that takes many seconds or minutes to complete.

If all of our statements execute very rapidly, then MTS works
well. We can effectively share a number of processes amongst a
larger community of users. If, on the other hand, we have
sessions that monopolize a shared server for extended periods of
time then we will see apparent database ‘hangs’. Say we
configured ten shared servers for 100 people. If, at some point,
ten people simultaneously execute the process that takes 45
seconds or longer then every other transaction (including new
connections) will have to wait. If some of the queued sessions
want to run that same long process, then we have a big problem
— the apparent ‘hang’ won’t last 45 seconds, it will appear
to last much longer for most people. Even if we only have a few
people wanting to execute this process simultaneously rather than
ten, we will still observe what appears to be a large degradation
in performance from the server. We are taking away, for an
extended period of time, a shared resource and this is not a good
thing. Instead of having ten shared servers processing quick
requests on a queue, we now have five or six (or less).
Eventually the system will be running at some fraction of its
capability, solely due to this resource being consumed.

The ‘quick and dirty’ solution was to start up more shared
servers, but the logical conclusion to this is that you need a
shared server per user and this is not a reasonable conclusion
for a system with thousands of users (as this system was). Not
only would that introduce bottlenecks into the system itself (the
more servers you have to manage — the more processing time
spent managing), but also it is simply not the way MTS was
designed to work.

The real solution to this problem was simple: do not execute long
running transactions under MTS. Implementing this solution was
not. There was more then one way to implement this and they all
required fundamental architectural changes. The most appropriate
way to fix this issue, requiring the least amount of change, was
to use Advanced Queues (AQ).

AQ is a message-oriented middleware hosted in the
Oracle database. It provides the ability for a client session to
enqueue a message into a database queue table. This message is
later, typically immediately after committing, ‘dequeued’ by
another session and the content of the message is inspected. This
message contains information for the other session to process. It
can be used to give the appearance of lightening fast response
times by decoupling the long running process from the interactive
client.

So, rather than execute a 45-second process, the bean would place
the request, along with all its inputs, on a queue and execute it
in a loosely coupled (asynchronous) rather than tightly coupled
(synchronous) fashion. In this way, the end user would not have
to wait 45 seconds for a response — the system would
apparently be much more responsive.

While this approach sounds easy — just drop in ‘AQ’ and the
problem is fixed — there was more to it than that. This 45-
second process generated a transaction ID that was required by
the next step in the interface in order to join to other tables
— as designed, the interface would not work without it. By
implementing AQ, we were not waiting for this transaction ID to
be generated here — we were just asking the system to do it
for us at some point. So, the application was stuck. On the one
hand, we could not wait 45 seconds for the process to complete,
but on the other hand, we needed the generated ID in order to
proceed to the next screen and we could only get that after
waiting 45 seconds. To solve this problem, what we had to do was
to synthesize a pseudo-transaction ID, modify the long running
process to accept this generated pseudo ID and have it update a
table when it was done, by which mechanism the real transaction
ID was associated with the pseudo id. That is, instead of the
transaction ID being an output of the long running process, it
would be an input to it. Further, all ‘downstream’ tables would
have to use this pseudo-transaction ID — not the real one
(since the real one would not be generated for a while). We also
had to review the usage of this transaction ID in order to see
what impact this change might have on other modules and so on.

Another consideration was the fact that, while we were running
synchronously, if the 45-second process failed then the end user
was alerted right away. They would fix the error condition (fix
the inputs, typically) and resubmit the request. Now, with the
transaction being processed asynchronously under AQ, we don’t
have that ability. New functionality had to be added in order to
support this delayed response. Specifically, we needed some
workflow mechanism to route any failed transaction to the
appropriate person.

The upshot of all this is that we had to undertake major changes
in the database structure. New software had to be added (AQ). New
processes had to be developed (workflows and such). On the plus
side, the removal of 45 seconds of lag time from an interactive
process not only solved the MTS architecture issue, it enhanced
the user experience — it meant that the end user got much
faster ‘apparent’ response times. On the down side, all of this
delayed the project considerably because none of it was detected
until immediately before deployment, during scalability testing.
It is just too bad that it was not designed the right way from
the beginning. With knowledge of how MTS worked physically, it
would have been clear that the original design would not scale
very well.

Latest Articles