Database Journal
MS SQL Oracle DB2 Access MySQL PostgreSQL Sybase PHP SQL Etc SQL Scripts & Samples Tips Database Forum Rss Feed

» Database Journal Home
» Database Articles
» Database Tutorials
MS Access
SQL Scripts & Samples
» Database Forum
» Slideshows
Free Newsletters:

News Via RSS Feed

Rss Feed

Database Journal |DBA Support |SQLCourse |SQLCourse2

Featured Database Articles

SQL etc

Posted Aug 18, 2010

Introducing Amazon SimpleDB - Page 2

By DatabaseJournal.com Staff

Sizing Up the SimpleDB Feature Set

The SimpleDB API exposes a limited set of features. Here is a list of what you get:

  • You can create named domains within your account. At the time of this writing, the initial allocation allows you to create up to 100 domains. You can request a larger allocation on the AWS website.
  • You can delete an existing domain at any time without first deleting the data stored in it.
  • You can store a data item for the first time or for subsequent updates using a call to PutAttributes. When you issue an update, you do not need to pass the full item; you can pass just the attributes that have changed.
  • There is a batch call that allows you to put up to 25 items at once.
  • You can retrieve the data with a call to GetAttributes.
  • You can query for items based on criteria on multiple attributes of an item.
  • You can store any type of data. SimpleDB treats it all as string data, and you are free to format it as you choose.
  • You can store different types of items in the same domain, and items of the same type can vary in which attributes have values.

Benefits of Using SimpleDB

When you use SimpleDB, you give up some features you might otherwise have, but as a trade-off, you gain some important benefits, as follows:

  • Availability—When you store your data in SimpleDB, it is automatically replicated across multiple storage nodes and across multiple data centers in the same region.
  • Simplicity—There are not a lot of knobs or dials, and there are not any configuration parameters. This makes it a lot harder to shoot yourself in the foot.
  • Scalability—The service is designed for scalability and concurrent access.
  • Flexibility—Store the data you need to store now, and if the requirements change, store it differently without changing the database.
  • Low latency within the same region—Access to SimpleDB from an EC2 instance in the same region has the latency of a typical LAN.
  • Low maintenance—Most of the administrative burden is transferred to Amazon. They maintain the hardware and the database software.

Database Features SimpleDB Doesn’t Have

There are a number of common database features noticeably absent from Amazon SimpleDB. Programs based on relational database products typically rely on these features. You should be aware of what you will not find in SimpleDB, as follows:

  • Full SQL support—A query language similar to SQL is supported for queries only. However, it only applies to “select” statements, and there are some syntax differences and other limitations.
  • Joins—You can issue queries, but there are no foreign keys and no joins.
  • Auto-incrementing primary keys—You have to create your own primary keys in the form of an item name.
  • Transactions—There are no explicit transaction boundaries that you can mark or isolation levels that you can define. There is no notion of a commit or a rollback. There is some implicit support for atomic writes, but it only applies within the scope of each individual item being written.

Higher-Level Framework Functionality

This simplicity of what SimpleDB offers on the server side is matched by the simplicity of what AWS provides in officially supported SimpleDB clients. There is a one-to-one mapping of service features to client calls. There is a lot of functionality that can be built atop the basic SimpleDB primitives. In addition, the inclusion of these advance features has already begun with a number of third-party SimpleDB clients. Popular persistence frameworks used as an abstraction layer above relational databases are prime candidates for this.

Some features normally included within the database server can be written into SimpleDB clients for automatic handling. Third-party client software is constantly improving, and some of the following features may be present already or you may have to write it for yourself:

  • Data formatting—Integers, floats, and dates require special formatting in some cases.
  • Object mapping—It can be convenient to map programming language objects to SimpleDB attributes.
  • Sharding—The domain is the basic unit of horizontal scalability in SimpleDB. However, there is no explicit support for automatically distributing data across domains.
  • Cache integration—Caching is an important aspect of many applications, and caching popular data objects is a well-understood optimization. Configurable caching that is well integrated with a SimpleDB client is an important feature.

Service Limits

There are quite a few limitations on what you are allowed to do with SimpleDB. Most of these are size and quantity restrictions. There is an underlying philosophy that small and quickly serviced units of work provide the greatest opportunity for load balancing and maintaining uniform service levels. AWS maintains a current listing of the service limitations within the latest online SimpleDB Developer Guide at the AWS website. At the time of this writing, the limits are as follows:

  • Max storage per domain: 10GB
  • Max attribute values per domain: 1 billion
  • Initial max domains per account: 100
  • Max attribute values per item: 256
  • Max length of item name, attribute name, or value: 1024 bytes
  • Max query execution time: 5 seconds
  • Max query results: 2500
  • Max query response size: 1MB
  • Max comparisons per query: 20

These limits may seem restrictive when compared to the unlimited nature of data sizes you can store in other database offerings. However, there are two things to keep in mind about these limits. First, SimpleDB is not a general-purpose data store suitable for everything. It is specifically designed for storing small chunks of data. For larger data objects that you want to store in the cloud, you are advised to use Amazon S3. Secondly, consider the steps that need to be taken with a relational database at higher loads when performance begins to degrade. Typical recommendations often include offloading processing from the database, reducing long-running queries, and applying selective de-normalization of the data. These limits are what help enable efficient and automatic background replication and high concurrency and availability. Some of these limits can be worked around to a degree, but no workarounds exist for you to make SimpleDB universally appropriate for all data storage needs.

Abandoning the Relational Model?

There have been many recent products and services offering data storage but rejecting the relational model. This trend has been dubbed by some as the NoSQL movement. There is a fair amount of enthusiasm both for and against this trend. A few of those in the “against” column argue that databases without schemas, type checking, normalization, and so on are throwing away 40 years of database progress. Likewise, some proponents are quick to dispense the hype about how a given NoSQL solution will solve your problems. The aim of this section is to present a case for the value of a service like SimpleDB that addresses legitimate criticism and avoids hype and exaggeration.

A Database Without a Schema

One of the primary areas of contention around SimpleDB and other NoSQL solutions centers on the lack of a database schema. Database schemas turn out to be very important in the relational model. The formalism of predefining your data model into a schema provides a number of specific benefits, but it also imposes restrictions.

SimpleDB has no notion of a schema at all. Many of the structures defined in a typical database schema do not even exist in SimpleDB. This includes things such as stored procedures, triggers, relationships, and views. Other elements of a database schema like fields and types do exist in SimpleDB but are flexible and are not enforced on the server. Still other features, like indexes, require no formal definition because the SimpleDB service creates and manages them behind the scenes.

However, the lack of a schema requirement in SimpleDB does not prevent you from gaining the benefits of a schema. You can create your own schema for whatever portion of your data model that is appropriate. This allows you to cherry-pick the benefits that are helpful to your application without the unneeded restrictions.

One of the most important things you gain from codifying your data layout is a separation between it and the application. This is an enabling feature for tools and application plug-ins. Third-party tools can query your data, convert your data from one format to another, and analyze and report on your data based solely on the schema definition. The alternative is less attractive. Tools and extensions are more limited in what they can do without knowledge of the formats. For example, you cannot compute the sum of values in a numeric column if you do not know the format of that column. In the degenerate case, developers must search through your source code to infer data types.

In SimpleDB, many of the most common database features are not available. Query, however, is one important feature that is present and has some bearing on your data formatting. Because all the data you store in SimpleDB is variable length character data, you must apply padding to numeric data in order for queries to work properly. For example, if you want to store an attribute named “price” with a value of “269.94,” you must first add leading zeros to make it “00000269.94.” This is required because greater-than and less-than comparisons within SimpleDB compare each character from left to right. Padding with zeros allows you to line up the decimal point so the comparisons will be correct for all possible values of that attribute. Relational database products handle this for you behind the scenes when you declare a column type is a numeric type like int.

This is a case in SimpleDB where a schema is beneficial. The code that initially imports records into SimpleDB, the code that writes records as your app runs, and any code that uses a numeric attribute in a query all need to use the exact same format. Explicitly storing the schema externally is a much less error-prone approach than implicitly defining the format in duplicated code across various modules.

Another benefit of the predefined schema in the relational model is that it forces you to think through the data relationships and make unambiguous decisions about your data layout. Sometimes, however, the data is simple, there are no relationships, and creating a data model is overkill. Sometimes you may still be in the process of defining the data model. SimpleDB can be used as part of the prototyping process, enabling you to evolve your schema dynamically as issues surface that may not otherwise have become known so quickly. You may be migrating from a different database with an existing data model. The important thing to remember is that SimpleDB is simple by design. It can be useful in a variety of situations and does not prevent you from creating your own schema external to SimpleDB.

Areas Where Relational Databases Struggle

Relational databases have been around for some time. There are many robust and mature products available. Modern database products offer a multitude of features and a host of configuration options.

One area where difficulty arises is with database features that you do not need or that you should not use for a particular application. Applications that have simple data storage requirements do not benefit from the myriad of available options. In fact, it can be detrimental in a couple different ways. If you need to learn the intricacies of a particular database product before you can make good use of it, the time spent learning takes away from time you could have spent on your application. Knowledge of how database products work is good to have. It would be hard to argue that you wasted your time by learning it because that information could serve you well far into the future. Similarly, if there is a much simpler solution that meets your needs, you could choose that instead. If you had no immediate requirement to gain product specific database expertise, it would be hard to insist that you made the wrong choice. It is a tough sell to argue that the more time-consuming, yet educational, route is always better than the simple and direct route. This is a challenge faced by databases today, when the simple problems are not met with simple solutions.

Another pain point with relational databases is horizontal scaling. It is easy to scale a database vertically by beefing up your server because memory and disk drives are inexpensive. However, scaling a database across multiple servers can be extremely difficult. There is a whole spectrum of options available for horizontal scaling that includes basic master-slave replication as well as complicated sharding strategies. These solutions each require a different, and sometimes considerable, amount of expertise. Nevertheless, they all have one thing in common when compared to vertical scaling solutions. On top of the implementation difficulty, each additional server results in an additional increase in ongoing maintenance responsibility. Moreover, it is not merely the additional server maintenance of having more servers. I am referring to the actual database administration tasks of managing additional replicas, backups, and log shipping. It also includes the tasks of rolling out schema changes and new indexes to all servers in the cluster.

If you are in a situation where you want a simple database solution or you want horizontal scaling, SimpleDB is definitely a service to consider. However, you may need to be prepared to defend your decision.

Scalability Isn’t Your Problem

Around every corner, you can find people who will challenge your efforts to scale horizontally. Beyond the cost and difficulty, there is a degree of resistance to products and services that seek to solve these problems.

The typical, and now clichéd, advice tends to be that scalability is not your problem, and trying to solve scalability at the outset is a case of premature optimization. This is followed by a discussion of how many daily page views a single high-performance database server can support. Finally, it ends by noting that it is really just a problem for when you reach the scale of Google or Amazon.

The premise of the argument is actually solid, although not applicable to all situations. The premise is that when you are building a site or service that nobody has heard of yet, you are more concerned about handling loads of people than about making the site remarkable. It is good advice for these situations. Moreover, it is especially timely considering that there is a small but religious segment of Internet commentators who eagerly chime, “X doesn’t scale,” where X is any alternative to the solution the commenter uses. Among programmers, there is a general preoccupation with performance optimization that seems somewhat out of balance.

The fact is that for many projects, scalability really is not your problem, but availability can be. Distributing your data store across servers from the outset is not a premature optimization when you can quantify the cost of down time. If a couple hours of downtime will have an impact on your business, then availability is something worth thinking about. For the IT department delivering a mission-critical application, availability is important. Even if only 20 users will use it during normal business hours, when it provides a competitive advantage, it is important to maintain availability through expected outages. When you have a product launch, and your credibility is at stake as much as your revenue, you are not putting the cart before the horse when you protect yourself against hardware failures.

There are many situations where availability is an important system quality. Look at how common it is for a multi-server web cluster to host one website. Before you can add a second web server, you must first solve a small set of known problems. User sessions have to be managed properly; load balancing has to be in place and routing around unresponsive servers. However, web server clusters are useful for more than high-traffic load handling. They are also beneficial because we know that hardware will fail, and we want to maintain service during the failure. We can add another web server because it is neither costly nor difficult, and it improves the availability. With the advent of systems designed to provide higher database availability that are not costly nor hard, availability becomes worth pursuing for less-critical projects.

Avoiding the SimpleDB Hype

There are many different application scenarios where SimpleDB is an interesting option. That said, some people have overstated the benefits of using SimpleDB specifically and hosted NoSQL databases in general. The reasoning seems to be that services running on the infrastructure of companies like Amazon, Google, or Microsoft will undoubtedly have nearly unlimited automatic scalability. Although there is nothing wrong with enthusiasm for products and services that you like, it is good to base that enthusiasm on reality.

Do not be fooled into thinking that any of these new databases is going to be a panacea. Make sure you educate yourself about the pros and cons of each solution as you evaluate it. The majority of services in this space have a free usage tier, and all the open-source alternatives are completely free to use. Take advantage of it, and try them out for yourself. We live in an amazing time in history where the quantity of information available at our fingertips is unprecedented. Access to web-based services and open-source projects is a huge opportunity. The tragedy is that in a time when it has never been easier to gain personal experience with new technology, all too often we are tempted to adopt the opinions of others instead of taking the time to form our own opinions. Do not believe the hype—find out for yourself.

Putting the DBA Out of Work

One of the stated goals of SimpleDB is allowing customers to outsource the time and effort associated with managing a web-scale database. Managing the database is traditionally the world of the DBA. Some people have assumed that advocating the use of SimpleDB amounts to advocating a world where the DBA diminishes in importance. However, this is not the case at all.

One of the things that have come about from the widespread popularity of EC2 has been a change in the role of system administrators. What we have found is that managing EC2 virtual instances is less work than managing a physical server instance. However, the result has not been a rash of system administrator firings. Instead, the result has been that system administrators are able to become more productive by managing larger numbers of servers than they otherwise could. The ease of acquisition and the low cost to acquire and release the computing power have led, in many cases, to a greater and more dynamic use of the servers. In other words, organizations are using more server instances because the various levels of the organization can handle it, from a cost, risk, and labor standpoint.

SimpleDB and its cohorts seem to facilitate a similar change but on a smaller scale. First, SimpleDB has less general applicability than EC2. It is a suitable solution for a much smaller set of problems. AWS fully advocates the use of existing relational database products. SimpleDB is an additional option, not a replacement. Moreover, SimpleDB finds good usage in some areas where a relational database might not normally be used, as in the case of storing web user session data. In addition, for those projects that choose to use SimpleDB instead of, or along with, a relational database, it does not mean that there is no role for the DBA. Some tasks remain similar to EC2, which can result in a greater capacity for IT departments to create solutions.

Dodging Copies of C.J. Date

There are database purists who wholeheartedly try to dissuade people from using any type of non-relational database on principle alone. Not only that, but they also go to great lengths to advocate the proper use of relational databases and lament the fact that no current database products correctly implement the relational model. Having found the one-true data storage paradigm, they believe that the relational model is “right” and is the only one that will last. The purists are not wrong in their appreciation for the relational model and for SQL. The relational model is the cornerstone of the database field, and more than that, an invaluable contribution to the world of computing. It is one of the two best things to come out of 1969. Invented by a mathematician and considered a branch of mathematics itself, there is a solid theoretical rigor that underlies its principles. Even though it is not a complete or finished branch, the work to date has been sound.

The world of mathematics and academic research is an interesting place. When you have spent large quantities of your life and career there, you are highly qualified to make authoritative comments on topics like correctness and provability. Nevertheless, being either a relational model expert or merely someone who holds them in high regard does not say anything about your ability to deliver value to users. It is clearly true that modeling your data “correctly” can provide measurable benefits and that making mistakes in your model can lead to certain classes of problems. However, you can still provide significant user value with a flawed model, and correctness is no guarantee of success.

It is like perfectly generated XHTML that always validates. It is like programming with a functional style (in any programming language) that lets you prove your programs are correct. It is like maintaining unit tests that provide 100% test coverage for every line of code you write. There is nothing inherently bad you can say about these things. In fact, there are plenty of good things to say about them. The problem is not a technical problem—it is a people problem. The problem is when people become hyper-focused on narrow technological aspects to the exclusion of the broader issues of the application’s purpose.

The people conducting database research and the ones who take the time to help educate the computing industry deserve our respect. If you have a degree in computer science, chances are you studied C.J. Date’s work in your database class. Among professional programmers, there is no good excuse for not knowing data and relational fundamentals. However, the person in the next row of cubicles who is only contributing condescending criticism to your project is no C.J. Date. In addition, the user with 50 times your stackoverflow.com reputation who ridicules the premise of your questions without providing useful suggestions is no E.F. Codd. Understanding the theory is of great importance. Knowing how to deliver value to your users is of greater importance. In the end, avoid vociferous ignorance and don’t let anyone kick copies of C.J. Date in your face.

SQL etc Archives

Latest Forum Threads
SQL etc Forum
Topic By Replies Updated
MySQL rollback UAL225 0 August 21st, 09:56 PM
Complex Search Query Galway 0 May 20th, 10:04 PM
change collation at once supercain 2 May 15th, 06:18 AM
SQL Features, tools and utilities question Neomite 1 April 10th, 09:13 AM