The Complete Guide to Building Cloud Computing Solutions with Amazon SimpleDB. ‘Introducing Amazon SimpleDB’ is extracted from ‘A Developer’s Guide to Amazon SimpleDB’, published by Addison-Wesley Professional.
A Developer’s Guide to Amazon SimpleDB By Mocky Habeeb Published Feb 9, 2010 by Addison-Wesley Professional. Part of the Developer’s Library series. ISBN-10: 0-321-68597-0 ISBN-13: 978-0-321-68597-1 |
Amazon
has been offering its customers computing infrastructure via Amazon Web
Services (AWS) since 2006. AWS aims to use its own infrastructure to provide
the building blocks for other organizations to use. The Elastic Compute Cloud
(EC2) is an AWS offering that enables you to spin up virtual servers as you
need the computing power and shut them off when you are done. Amazon Simple
Storage Service (S3) provides fast and unlimited file storage for the web.
Amazon SimpleDB is a service designed to complement EC2 and S3, but the concept
is not as easy to grasp as “extra servers” and “extra storage.” This chapter
will cover the concepts behind SimpleDB and discuss how it compares to other
services.
What Is SimpleDB?
SimpleDB is a web service providing structured data storage
in the cloud and backed by clusters of Amazon-managed database servers. The
data requires no schema and is stored securely in the cloud. There is a query
function, and all the data values you store are fully indexed. In keeping with
Amazon’s other web services, there is no minimum charge, and you are only
billed for your actual usage.
What SimpleDB Is Not
The name “SimpleDB” might lead you to believe that it is just
like relational database management systems (RDBMS), only simpler to use. In some
respects, this is true, but it is not just about making simplistic database usage
simpler. SimpleDB aims to simplify the much harder task of creating and managing
a database cluster that is fault-tolerant in the face of multiple failures, replicated
across data centers, and delivers high levels of availability.
One misconception that seems to be very common among people
just learning about SimpleDB is the idea that migrating from an RDBMS to
SimpleDB will automatically solve your database performance problems.
Performance certainly is an important part of the equation when you seek to
evaluate databases. Unfortunately, for some people, speed is the beginning and
the end of the thought process. It can be tempting to view any of the new
hosted database services as a silver bullet when offered by a mega-company like
Microsoft, Amazon, or Google. But the fact is that SimpleDB is not going to
solve your existing speed issues. The service exists to solve an entirely
different set of problems. Reads and writes are not blazingly fast. They are
meant to be “fast enough.” It is entirely possible that AWS may increase
performance of the service over time, based on user feedback. But SimpleDB is
never going to be as speedy as a standalone database running on fast hardware.
SimpleDB has a different purpose.
Robust database clusters replicating data across multiple data
centers is not a data storage solution that is typically easy to throw
together. It is a time consuming and costly undertaking. Even in organizations
that have the database administrator (DBA) expertise and are using multiple
data centers, it is still time consuming. It is costly enough that you would
not do it unless there was a quantifiable business need for it. SimpleDB offers
data storage with these features on a pay-as-you-go basis.
Of course, taking advantage of these features is not without a
downside. SimpleDB is a moderately restrictive environment, and it is not
suitable for many types of applications. There are various restrictions and
limitations on how much data can be stored and transferred and how much network
bandwidth you can consume.
Schema-Less Data
SimpleDB differs from relational databases where you must
define a schema for each database table before you can use it and where you
must explicitly change that schema before you can store your data differently.
In SimpleDB, there is no schema requirement. Although you still have to consider
the format of your data, this approach has the benefit of freeing you from the
time it takes to manage schema modifications.
The lack of schema means that there are no data types; all data
values are treated as variable length character data. As a result, there is
literally nothing extra to do if you want to add a new field to an existing
database. You just add the new field to whichever data items require it. There
is no rule that forces every data item to have the same fields.
The drawbacks of a schema-less database include the lack of automatic
integrity checking in the database and an increased burden on the application to
handle formatting and type conversions. Detailed coverage of the impact of schema-less
data on queries appears in Chapter 4, “A Closer Look at Select,” along with a discussion
of the formatting issues.
Stored Securely in the Cloud
The data that you store in SimpleDB is available both from the
Internet and (with less latency) from EC2. The security of that data is of great
importance for many applications, while the security of the underlying web services
account should be important to all users.
To protect that data, all access to SimpleDB, whether read or
write, is protected by your account credentials. Every request must bear the
correct and authorized digital signature or else it is rejected with an error
code. Security of the account, data transmission, and data storage is the
subject of Chapter 8, “Security in SimpleDB-Based Applications.”
Billed Only for Actual Usage
In keeping with the AWS philosophy of pay-as-you-go,
SimpleDB has a pricing structure that includes charges for data storage, data
transfer, and processor usage. There are no base fees and there are no
minimums. At the time of this writing, Amazon’s monthly billing for SimpleDB
has a free usage tier that covers the first gigabyte (GB) of data storage, the
first GB of data transfer, and the first 25 hours of processor usage each
month. Data transfer costs beyond the free tier have historically been on par with
S3 pricing, whereas storage costs have always been somewhat higher. Consult the
AWS website at https://aws.
amazon.com/simpledb/ for current pricing information.
Domains, Items, and Attribute Pairs
The top level of data storage in SimpleDB is the domain. A domain
is roughly analogous to a database table. You can create and delete domains as needed.
There are no configuration options to set on a domain; the only parameter you can
set is the name of the domain.
All the data stored in a SimpleDB domain takes the form of
name-value attribute pairs. Each attribute pair is associated with an item,
which plays the role of a table row. The attribute name is similar to a
database column name but unlike database rows that must all have identical
columns, SimpleDB items can each contain different attribute names. This gives
you the freedom to store different data in some items without changing the
layout of other items that do not have that data. It also allows the painless
addition of new data fields in the future.
Multi-Valued Attributes
It is possible for each attribute to have not just one
value, but an array of values. For example, an application that allows user
tagging can use a single attribute named “tags” to hold as many or as few tags
as needed for each item. You do not need to change a schema definition to
enable multi-valued attributes. All you need to do is add another attribute to
an item and use the same attribute name with a different value. This provides
you with flexibility in how you store your data.
Queries
SimpleDB is primarily a key-value
store, but it also has useful query functionality. A SQL-style query language is
used to issue queries over the scope of a single domain. A subset of the SQL select
syntax is recognized. The following is an example SimpleDB select statement:
SELECT * FROM products WHERE rating > '03' ORDER BY rating LIMIT 10
You put a domain name—in this
case, products—in the FROM clause where a table name would normally be. The WHERE
clause recognizes a dozen or so comparison operators, but an attribute name
must always be on the left side of the operator and a literal value must always
be on the right. There is no relational comparison between attributes allowed
here. So, the following is not valid:
SELECT * FROM users WHERE creation-date = last-activity-date
All the data stored in SimpleDB is treated as plain string
data. There are no explicit indexes to maintain; each value is automatically
indexed as you add it.
High Availability
High availability is an important benefit of using SimpleDB.
There are many types of failures that can occur with a database solution that
will affect the availability of your application. When you run your own
database servers, there is a spectrum of different configurations you can
employ.
To help quantify the availability benefits that you get
automatically with SimpleDB, let’s consider how you might achieve the same
results using replication for your own database servers. At the easier end of
the spectrum is a master-slave database replication scheme, where the master
database accepts client updates and a second database acts as a slave and pulls
all the updates from the master. This eliminates the single point of failure.
If the master goes down, the slave can take over. Managing these failures (when
not using SimpleDB) requires some additional work for swapping IP addresses or
domain name entries, but it is not very difficult.
Moving toward the more difficult end of the self-managed replication
spectrum allows you to maintain availability during failure that involves more than
a single server. There is more work to be done if you are going to handle two servers
going down in a short period, or a server problem and a network outage, or a problem
that affects the whole data center.
Creating a database solution that maintains uptime during these
more severe failures requires a certain level of expertise. It can be
simplified with cloud computing services like EC2 that make it easy to start
and manage servers in different geographical locations. However, when there are
many moving parts, the task remains time consuming. It can also be expensive.
When you use SimpleDB, you get high availability with your data
replicated to different geographic locations automatically. You do not need to
do any extra work or become an expert on high availability or the specifics of
replication techniques for one vendor’s database product. This is a huge
benefit not because that level of expertise is not worth attaining, but because
there is a whole class of applications that previously could not justify that
effort.
Database Consistency
One of the consequences of replicating database updates
across multiple servers and data centers is the need to decide what kind of
consistency guarantees will be maintained. A database running on a single
server can easily maintain strong consistency. With strong consistency, after
an update occurs, every subsequent database access by every client reflects the
change and the previous state of the database is never seen.
This can be a problem for a database cluster if the purpose of
the cluster is to improve availability. If there is a master database
replicating updates to slave databases, strong consistency requires the slaves
to accept the update at the same time as the master. All access to the database
would then be strongly consistent. However, in the case of a problem preventing
communication between the master and a slave, the master would be unable to
accept updates because doing so out of sync with a slave would break the
consistency guarantee. If the database rejects updates during even simple
problem scenarios, it defeats the availability. In practice, replication is
often not done this way. A common solution to this problem is to allow only the
master database to accept updates and do so without direct contact with any
slave databases. After the master commits each transaction, slaves are sent the
update in near real-time. This amounts to a relaxing of the consistency
guarantee. If clients only connect to the
slave when the master goes down, then the weakened consistency only applies to
this scenario.
SimpleDB sports the option of either eventual consistency or
strong consistency for each read request. With eventual consistency, when you
submit an update to SimpleDB, the database server handling your request will
forward the update to the other database servers where that domain is
replicated. The full update of all replicas does not happen before your update
request returns. The replication continues in the background while other
requests are handled. The period of time it takes for all replicas to be
updated is called the eventual consistency window. The eventual consistency
window is usually small. AWS does not offer any guarantees about this window,
but it is frequently less than one second.
A couple things can make the consistency window larger. One is
a high request load. If the servers hosting a given SimpleDB domain are under
heavy load, the time it takes for full replication is increased. Additionally a
network or server failure can block replication until it is resolved. Consider
a network outage between data centers hosting your data. If the SimpleDB
load-balancer is able to successfully route your requests to both data centers,
your updates will be accepted at both locations. However, replication will fail
between the two locations. The data you fetch from one will not be consistent
with updates you have applied to the other. Once the problem is fixed, SimpleDB
will complete the replication automatically.
Using a consistent read eliminates the consistency window for that
request. The results of a consistent read will reflect all previous writes. In
the normal case, a consistent read is no slower than an eventually consistent
read. However, it is possible for consistent read requests to display higher
latency and lower bandwidth on occasion.