Sizing Up the SimpleDB Feature Set
The SimpleDB API exposes a
limited set of features. Here is a list of what you get:
- You
can create named domains within your account. At the time of this writing, the
initial allocation allows you to create up to 100 domains. You can request a
larger allocation on the AWS website.
- You can delete an
existing domain at any time without first deleting the data stored in it.
- You can store a
data item for the first time or for subsequent updates using a call to PutAttributes. When you
issue an update, you do not need to pass the full item; you can pass just the
attributes that have changed.
- There is a batch
call that allows you to put up to 25 items at once.
- You can retrieve
the data with a call to GetAttributes.
- You can query
for items based on criteria on multiple attributes of an item.
- You can store
any type of data. SimpleDB treats it all as string data, and you are free to
format it as you choose.
- You
can store different types of items in the same domain, and items of the same
type can vary in which attributes have values.
Benefits of Using SimpleDB
When you use SimpleDB, you
give up some features you might otherwise have, but as a trade-off, you gain
some important benefits, as follows:
- AvailabilityWhen you store your data in SimpleDB, it
is automatically replicated across multiple storage nodes and across multiple data
centers in the same region.
- SimplicityThere are not a lot of knobs or dials,
and there are not any configuration parameters. This makes it a lot harder to
shoot yourself in the foot.
- ScalabilityThe service is designed for scalability
and concurrent access.
- FlexibilityStore the data you need to store now,
and if the requirements change, store it differently without changing the
database.
- Low latency within the same regionAccess to
SimpleDB from an EC2 instance in the same region has the latency of a typical
LAN.
- Low maintenanceMost of the administrative burden
is transferred to Amazon. They maintain the hardware and the database software.
Database Features SimpleDB Doesnt Have
There are a number of common
database features noticeably absent from Amazon SimpleDB. Programs based on
relational database products typically rely on these features. You should be
aware of what you will not find in SimpleDB, as follows:
- Full SQL supportA query language similar to SQL is
supported for queries only. However, it only applies to select statements,
and there are some syntax differences and other limitations.
- JoinsYou can issue queries, but there are no
foreign keys and no joins.
- Auto-incrementing primary keysYou have to create
your own primary keys in the form of an item name.
- TransactionsThere are no explicit transaction
boundaries that you can mark or isolation levels that you can define. There is
no notion of a commit or a rollback. There is some implicit support for atomic
writes, but it only applies within the scope of each individual item being
written.
Higher-Level Framework Functionality
This simplicity of what SimpleDB offers on the server side is
matched by the simplicity of what AWS provides in officially supported SimpleDB
clients. There is a one-to-one mapping of service features to client calls. There
is a lot of functionality that can be built atop the basic SimpleDB primitives.
In addition, the inclusion of these advance features has already begun with a number
of third-party SimpleDB clients. Popular persistence frameworks used as an abstraction
layer above relational databases are prime candidates for this.
Some features normally included
within the database server can be written into SimpleDB clients for automatic
handling. Third-party client software is constantly improving, and some of the
following features may be present already or you may have to write it for
yourself:
- Data formattingIntegers, floats, and dates require
special formatting in some cases.
- Object mappingIt can be convenient to map
programming language objects to SimpleDB attributes.
- ShardingThe domain is the basic unit of horizontal
scalability in SimpleDB. However, there is no explicit support for automatically
distributing data across domains.
- Cache integrationCaching is an important aspect of
many applications, and caching popular data objects is a well-understood
optimization. Configurable caching that is well integrated with a SimpleDB
client is an important feature.
Service Limits
There are quite a few limitations
on what you are allowed to do with SimpleDB. Most of these are size and
quantity restrictions. There is an underlying philosophy that small and quickly
serviced units of work provide the greatest opportunity for load balancing and
maintaining uniform service levels. AWS maintains a current listing of the
service limitations within the latest online SimpleDB Developer Guide at the
AWS website. At the time of this writing, the limits are as follows:
- Max
storage per domain: 10GB
- Max attribute values
per domain: 1 billion
- Initial max
domains per account: 100
- Max attribute
values per item: 256
- Max length of
item name, attribute name, or value: 1024 bytes
- Max query
execution time: 5 seconds
- Max query
results: 2500
- Max query response
size: 1MB
- Max
comparisons per query: 20
These limits may seem restrictive when compared to the
unlimited nature of data sizes you can store in other database offerings.
However, there are two things to keep in mind about these limits. First, SimpleDB
is not a general-purpose data store suitable for everything. It is specifically
designed for storing small chunks of data. For larger data objects that you
want to store in the cloud, you are advised to use Amazon S3. Secondly,
consider the steps that need to be taken with a relational database at higher
loads when performance begins to degrade. Typical recommendations often include
offloading processing from the database, reducing long-running queries, and
applying selective de-normalization of the data. These limits are what help
enable efficient and automatic background replication and high concurrency and
availability. Some of these limits can be worked around to a degree, but no
workarounds exist for you to make SimpleDB universally appropriate for all data
storage needs.
Abandoning the Relational Model?
There have been many recent products and services offering
data storage but rejecting the relational model. This trend has been dubbed by
some as the NoSQL movement. There is a fair amount of enthusiasm both for and
against this trend. A few of those in the against column argue that databases
without schemas, type checking, normalization, and so on are throwing away 40
years of database progress. Likewise, some proponents are quick to dispense the
hype about how a given NoSQL solution will solve your problems. The aim of this
section is to present a case for the value of a service like SimpleDB that
addresses legitimate criticism and avoids hype and exaggeration.
A Database Without a Schema
One of the primary areas of contention around SimpleDB and
other NoSQL solutions centers on the lack of a database schema. Database
schemas turn out to be very important in the relational model. The formalism of
predefining your data model into a schema provides a number of specific
benefits, but it also imposes restrictions.
SimpleDB has no notion of a schema at all. Many of the
structures defined in a typical database schema do not even exist in SimpleDB.
This includes things such as stored procedures, triggers, relationships, and
views. Other elements of a database schema like fields and types do exist in
SimpleDB but are flexible and are not enforced on the server. Still other
features, like indexes, require no formal definition because the SimpleDB
service creates and manages them behind the scenes.
However, the lack of a schema requirement in SimpleDB does not
prevent you from gaining the benefits of a schema. You can create your own
schema for whatever portion of your data model that is appropriate. This allows
you to cherry-pick the benefits that are helpful to your application without
the unneeded restrictions.
One of the most important things you gain from codifying your
data layout is a separation between it and the application. This is an enabling
feature for tools and application plug-ins. Third-party tools can query your
data, convert your data from one format to another, and analyze and report on
your data based solely on the schema definition. The alternative is less
attractive. Tools and extensions are more limited in what they can do without
knowledge of the formats. For example, you cannot compute the sum of values in
a numeric column if you do not know the format of that column. In the
degenerate case, developers must search through your source code to infer data
types.
In SimpleDB, many of the most common database features are not
available. Query, however, is one important feature that is present and has
some bearing on your data formatting. Because all the data you store in
SimpleDB is variable length character data, you must apply padding to numeric
data in order for queries to work properly. For example, if you want to store
an attribute named price with a value of 269.94, you must first add leading
zeros to make it 00000269.94. This is required because greater-than and
less-than comparisons within SimpleDB compare each character from left to
right. Padding with zeros allows you to line up the decimal point so the
comparisons will be correct for all possible values of that attribute.
Relational database products handle this for you behind the scenes when you
declare a column type is a numeric type like int.
This is a case in SimpleDB where a schema is beneficial. The
code that initially imports records into SimpleDB, the code that writes records
as your app runs, and any code that uses a numeric attribute in a query all
need to use the exact same format. Explicitly storing the schema externally is
a much less error-prone approach than implicitly defining the format in
duplicated code across various modules.
Another benefit of the predefined schema in the relational
model is that it forces you to think through the data relationships and make
unambiguous decisions about your data layout. Sometimes, however, the data is
simple, there are no relationships, and creating a data model is overkill.
Sometimes you may still be in the process of defining the data model. SimpleDB
can be used as part of the prototyping process, enabling you to evolve your
schema dynamically as issues surface that may not otherwise have become known
so quickly. You may be migrating from a different database with an existing
data model. The important thing to remember is that SimpleDB is simple by
design. It can be useful in a variety of situations and does not prevent you
from creating your own schema external to SimpleDB.
Areas Where Relational Databases Struggle
Relational databases have been around for some time. There
are many robust and mature products available. Modern database products offer a
multitude of features and a host of configuration options.
One area where difficulty arises is with database features that
you do not need or that you should not use for a particular application. Applications
that have simple data storage requirements do not benefit from the myriad of available
options. In fact, it can be detrimental in a couple different ways. If you need
to learn the intricacies of a particular database product before you can make good
use of it, the time spent learning takes away from time you could have spent on
your application. Knowledge of how database products work is good to have. It would
be hard to argue that you wasted your time by learning it because that information
could serve you well far into the future. Similarly, if there is a much simpler
solution that meets your needs, you could choose that instead. If you had no immediate
requirement to gain product specific database expertise, it would be hard to insist
that you made the wrong choice. It is a tough sell to argue that the more time-consuming,
yet educational, route is always better than the simple and direct route. This is
a challenge faced by databases today, when the simple problems are not met with
simple solutions.
Another pain point with relational databases is horizontal
scaling. It is easy to scale a database vertically by beefing up your server
because memory and disk drives are inexpensive. However, scaling a database
across multiple servers can be extremely difficult. There is a whole spectrum
of options available for horizontal scaling that includes basic master-slave
replication as well as complicated sharding strategies. These solutions each
require a different, and sometimes considerable, amount of expertise.
Nevertheless, they all have one thing in common when compared to vertical
scaling solutions. On top of the implementation difficulty, each additional
server results in an additional increase in ongoing maintenance responsibility.
Moreover, it is not merely the additional server maintenance of having more
servers. I am referring to the actual database administration tasks of managing
additional replicas, backups, and log shipping. It also includes the tasks of
rolling out schema changes and new indexes to all servers in the cluster.
If you are in a situation where you want a simple database
solution or you want horizontal scaling, SimpleDB is definitely a service to
consider. However, you may need to be prepared to defend your decision.
Scalability Isnt Your Problem
Around every corner, you can find people who will challenge
your efforts to scale horizontally. Beyond the cost and difficulty, there is a
degree of resistance to products and services that seek to solve these
problems.
The typical, and now clichéd, advice tends to be that
scalability is not your problem, and trying to solve scalability at the outset
is a case of premature optimization. This is followed by a discussion of how
many daily page views a single high-performance database server can support. Finally,
it ends by noting that it is really just a problem for when you reach the scale
of Google or Amazon.
The premise of the argument is actually solid, although not
applicable to all situations. The premise is that when you are building a site
or service that nobody has heard of yet, you are more concerned about handling
loads of people than about making the site remarkable. It is good advice for
these situations. Moreover, it is especially timely considering that there is a
small but religious segment of Internet commentators who eagerly chime, X
doesnt scale, where X is any alternative to the solution the commenter uses.
Among programmers, there is a general preoccupation with performance
optimization that seems somewhat out of balance.
The fact is that for many projects, scalability really is not your
problem, but availability can be. Distributing your data store across servers from
the outset is not a premature optimization when you can quantify the cost of down
time. If a couple hours of downtime will have an impact on your business, then availability
is something worth thinking about. For the IT department delivering a mission-critical
application, availability is important. Even if only 20 users will use it during
normal business hours, when it provides a competitive advantage, it is important
to maintain availability through expected outages. When you have a product launch,
and your credibility is at stake as much as your revenue, you are not putting the
cart before the horse when you protect yourself against hardware failures.
There are many situations where availability is an important
system quality. Look at how common it is for a multi-server web cluster to host
one website. Before you can add a second web server, you must first solve a
small set of known problems. User sessions have to be managed properly; load
balancing has to be in place and routing around unresponsive servers. However,
web server clusters are useful for more than high-traffic load handling. They
are also beneficial because we know that hardware will fail, and we want to
maintain service during the failure. We can add another web server because it
is neither costly nor difficult, and it improves the availability. With the
advent of systems designed to provide higher database availability that are not
costly nor hard, availability becomes worth pursuing for less-critical projects.
Avoiding the SimpleDB Hype
There are many different application scenarios where
SimpleDB is an interesting option. That said, some people have overstated the
benefits of using SimpleDB specifically and hosted NoSQL databases in general.
The reasoning seems to be that services running on the infrastructure of
companies like Amazon, Google, or Microsoft will undoubtedly have nearly
unlimited automatic scalability. Although there is nothing wrong with
enthusiasm for products and services that you like, it is good to base that
enthusiasm on reality.
Do not be fooled into thinking that any of these new databases
is going to be a panacea. Make sure you educate yourself about the pros and
cons of each solution as you evaluate it. The majority of services in this space
have a free usage tier, and all the open-source alternatives are completely
free to use. Take advantage of it, and try them out for yourself. We live in an
amazing time in history where the quantity of information available at our
fingertips is unprecedented. Access to web-based services and open-source
projects is a huge opportunity. The tragedy is that in a time when it has never
been easier to gain personal experience with new technology, all too often we
are tempted to adopt the opinions of others instead of taking the time to form
our own opinions. Do not believe the hypefind out for yourself.
Putting the DBA Out of Work
One of the stated goals of SimpleDB is allowing customers to
outsource the time and effort associated with managing a web-scale database.
Managing the database is traditionally the world of the DBA. Some people have
assumed that advocating the use of SimpleDB amounts to advocating a world where
the DBA diminishes in importance. However, this is not the case at all.
One of the things that have come about from the widespread
popularity of EC2 has been a change in the role of system administrators. What
we have found is that managing EC2 virtual instances is less work than managing
a physical server instance. However, the result has not been a rash of system
administrator firings. Instead, the result has been that system administrators
are able to become more productive by managing larger numbers of servers than
they otherwise could. The ease of acquisition and the low cost to acquire and
release the computing power have led, in many cases, to a greater and more
dynamic use of the servers. In other words, organizations are using more server
instances because the various levels of the organization can handle it, from a
cost, risk, and labor standpoint.
SimpleDB and its cohorts seem to facilitate a similar change
but on a smaller scale. First, SimpleDB has less general applicability than
EC2. It is a suitable solution for a much smaller set of problems. AWS fully
advocates the use of existing relational database products. SimpleDB is an
additional option, not a replacement. Moreover, SimpleDB finds good usage in
some areas where a relational database might not normally be used, as in the
case of storing web user session data. In addition, for those projects that
choose to use SimpleDB instead of, or along with, a relational database, it does
not mean that there is no role for the DBA. Some tasks remain similar to EC2,
which can result in a greater capacity for IT departments to create solutions.
Dodging Copies of C.J. Date
There are database purists who wholeheartedly try to
dissuade people from using any type of non-relational database on principle
alone. Not only that, but they also go to great lengths to advocate the proper
use of relational databases and lament the fact that no current database
products correctly implement the relational model. Having found the one-true
data storage paradigm, they believe that the relational model is right and is
the only one that will last. The purists are not wrong in their appreciation
for the relational model and for SQL. The relational model is the cornerstone
of the database field, and more than that, an invaluable contribution to the
world of computing. It is one of the two best things to come out of 1969.
Invented by a mathematician and considered a branch of mathematics itself,
there is a solid theoretical rigor that underlies its principles. Even though
it is not a complete or finished branch, the work to date has been sound.
The world of mathematics and academic research is an
interesting place. When you have spent large quantities of your life and career
there, you are highly qualified to make authoritative comments on topics like
correctness and provability. Nevertheless, being either a relational model
expert or merely someone who holds them in high regard does not say anything
about your ability to deliver value to users. It is clearly true that modeling
your data correctly can provide measurable benefits and that making mistakes
in your model can lead to certain classes of problems. However, you can still
provide significant user value with a flawed model, and correctness is no
guarantee of success.
It is like perfectly generated XHTML that always validates. It
is like programming with a functional style (in any programming language) that
lets you prove your programs are correct. It is like maintaining unit tests
that provide 100% test coverage for every line of code you write. There is
nothing inherently bad you can say about these things. In fact, there are
plenty of good things to say about them. The problem is not a technical
problemit is a people problem. The problem is when people become hyper-focused
on narrow technological aspects to the exclusion of the broader issues of the
applications purpose.
The people conducting database research and the ones who take
the time to help educate the computing industry deserve our respect. If you
have a degree in computer science, chances are you studied C.J. Dates work in
your database class. Among professional programmers, there is no good excuse
for not knowing data and relational fundamentals. However, the person in the
next row of cubicles who is only contributing condescending criticism to your
project is no C.J. Date. In addition, the user with 50 times your
stackoverflow.com reputation who ridicules the premise of your questions
without providing useful suggestions is no E.F. Codd. Understanding the theory
is of great importance. Knowing how to deliver value to your users is of
greater importance. In the end, avoid vociferous ignorance and dont let anyone
kick copies of C.J. Date in your face.