an article like this one will start out with the technical word
"scaling". Unfortunately like health care reform, everyone
can't always agree on what they mean by it, or even what the goal is. So, I deliberately chose not to use that word, and opted for non-technical words that we can all agree on. Typically, when our database is slowing down, we
want it to be faster, stronger, bigger and better!
that in mind I'm going to discuss some of the various ways to get there, and
hopefully put some of the technology options in perspective. This will
help you survey the landscape, and plan for your future needs. This first
part of the article will discuss query tuning and hardware changes, while the
second part will discuss adding additional servers, and application changes to
make that work.
been independent consulting for over twelve years now. Over that time,
the most common trouble spot has been errant SQL. SQL is the language you
use to communicate with the database, but it is also the frame with which you
lay down your problem, and feed it to a computer for solving.
Simplify the SQL
in the rush to get something out the door, we write SQL, which is asking for
more than it needs. Perhaps we need two rows, but we ask for the whole
table. During development, the table had only ten rows, so the code was
very fast. As it's rolled out in production, we find it getting slower
and slower. Turns out this little bit of code is on the main page of the
website, and new users to the site all hit it. Now the table has ten
million rows, and it is indeed a showstopper. We fire up our monitoring
tools, identify this query as a very slow query, a repeat offender. Adjusting
the SQL to read just the two rows our application wants, it still delivers the
application what it needs, but now taxes the database 1/5millionth as
much. Now that's a big savings, and now the database has time to do all
the other things its users are asking of it, and still has time to rest
Add indexes or adjust tables
trying to find 230
Fifth avenue in Manhattan, but the addresses are in random
order. You'd have to start at the beginning of Fifth avenue, and go to every single building
to see if it was the address you're looking for. This would take hours or
days *and* be a lot of work besides. You'd be tired out! That's
what happens to our database if we don't index our data correctly. The
query will instruct the database to find a record (like our address above) but won't
have an easy path to find it. Identifying these queries and fixing them
can speed up those queries dramatically, freeing the database to do lots of
Rewrite the SQL
query tuning in this case involves looking at the query plan, i.e. how the
database will find your data. If there is sorting, can we remove the
sorting? Can we join two tables in a different order to make it
faster? Can we rewrite IN statements as EXISTS, as they often perform
better? These types of rewrites are specific to your database platform
and version, so if you've just upgraded your database, your queries can perform
differently (faster or slower).
Rethink the problem
probably the most difficult to envision. It involves thinking outside the
proverbial box. I had a client some number of years ago whose application
synced data from the field via PDAs. Since their infancy as a company
they had always done fairly instant uploads of data. Users were immediately
be able to report and summarize their results. However, as their business
grew, the servers required to provide this type of premium service grew
suggestion was to provide a gold, silver and bronze level of service where
users would get instant, hourly, and daily uploads of data respectively, and
would pay a premium if they needed instant access. In the end most users
were fine with silver or bronze level of data uploads, and the ones who needed
gold were willing to pay more, thus funding the higher level of servers
required to fill that need. This arrangement also lent itself nicely to
partitioning users and allocating hardware as well.
is usually the first place people look to get more bang for their buck.
If I add a bigger engine to my car, or more farmers to a farm, I go faster and
get more work done. This intuitive notion is taken to computing. Unfortunately,
it doesn't always hold.
parlance of computer scientists, hardware additions will get you typically only
one order of magnitude speedup, that is 2x 4x 8x but typically less that
10x. That's not to say it's not worth considering. However, we need
to balance the cost versus the gain.
reviewing your hardware, the first stop should be the I/O subsystem. Are
you using RAID 5? If so, switch it out for RAID 10. How many disks
are in your RAID array? Only four or six? Can you get a RAID array
that can handle 10 or more disks? Such an array will be much faster
because you have more spindles, more tiny little heads on more platters working
together to read data for you faster. Also, verify that you don't have
any network bottlenecks associated with your disk subsystem. These can
hamper the throughput of your array because data has to be moved from the array
to main memory and back and forth.
consider how much main memory you have. More main memory can mean larger
buffer caches, thus enabling your database to keep more index and data blocks
in memory longer, or do larger sorts in main memory. Not only will you be
able to allocate the database kernel and caches more memory, but you'll be able
to give more memory to your sessions, or allow more overall sessions coming
from your webservers, and thus handle more webservers overall.
about CPUs? Typically, a 2 CPU SMP box is better than a single CPU box,
and 4 CPUs are better than 2 and so on. That's because of Unix's multi-process
architecture, so each CPU can handle a different process. Moreover, this
all works great for multi-process architecture applications. Except for
one tiny little problem. MySQL is a single process multi-threaded
architecture. So adding more CPUs will help but only to a point.
The MySQL development team is looking at these concurrency issues. The
architecture will evolve to scale better on large SMP boxes. For the time
being, however the recommended method is to run multiple MySQL instances on
different ports on the same box. This can give you better concurrency; however,
those separate instances cannot mount the same database (except using NDB
Storage engine). Therefore, it may complicate the application. We'll
get to that soon enough though.
application uses SQL to talk to the database, but also to frame the problem to
be solved. This is the often-overlooked area to tune, often providing the
largest bang for your buck. You may also look over your hardware, from
disk I/O subsystem, CPUs and main memory to get some big improvements there as
two of this article, we'll talk about scaling by adding additional
servers. When and where is it appropriate and how it works.
See All Articles by Columnist Sean Hull