Amazon Simpledb

Amazon SimpleDB - Technical Overview

Structured storage was one of the missing pieces in Amazon's cloud services jigsaw puzzle (the other has to be the ability to host a site completely on EC2 without using dynamic DNS hacks) and Amazon is plugging that hole today with the launch of SimpleDB.

There is an Information Week article and the official site is at http://aws.amazon.com/simpledb.

Here are the highlights as I see

  • Structured storage service in the same model as EC2 and S3.
  • Launching in a limited beta in a few weeks (I've signed up on the waiting list)
  • $0.14 per machine hour, $0.10 per GB of data transferred in. The data 'out' pricing is a bit interesting - $0.18 for the first 10 terabytes for the month, $0.16 for the next 40 TB and $0.13 for all data after that.
  • Apart from this, there is a cost for storing data -$1.50 per GB per month. Note that you can store data on S3 for cheaper rates and transfer it to SimpleDB for free
  • 10 GB maximum per domain (think 'table') and a limit of 100 such domains. Amazon indicates that these restrictions will be loosened soon.
  • Data model is similar to that of a spreadsheet, except that each 'cell' can have multiple values. The screenshot below (stolen from their documentation) shows off the model very well. The individual worksheets are 'domains' in SimpleDB terminology, rows are 'items' and columns are 'attributes'. However, each item can have more than one value for an attribute (a car could be both 'blue' and 'black') and attributes are optional. In this respect, the model is similar to Google's BigTable. They are pretty dissimilar in other aspects - BigTable stores previous versions of all values and SimpleDB can do limited subset of SQL.
image_thumb.png

There were a bunch of things which caught my eye

Pricing

The pricing for storing data on SimpleDB is much higher than the costs on S3. Storing 1 GB of data on S3 for a month is going to cost you $0.15 while the same on SimpleDB is going to set you back on by $1.50. This points to Amazon using pretty different hardware for the two services.

I'm also fascinated by the idea of 'box usage'. For every query, Amazon returns the amount of 'machine time' used to execute that query. Since these queries are almost surely getting distributed over a variety of nodes, I'm curious to know how this 'machine time' is calculated.

Data Model and APIs

I love the data model for SimpleDB. I've never been a fan of relational tables and SQLs and prefer data structures where everything's just one huge hashtable. Though SimpleDB's data model is not exactly a hashtable, it is pretty close. There are several things to like here if your programming loyalty lies with the dynamic side

  • Optional attributes
  • No types - you can stick any value into any attribute.
  • Multiple values per attribute (think lists or tuples)

All these things are possible using standard databases but would require quite a bit of work. And changing table schemas once you've piled up a decent amount of data is definitely not fun.

To program against the data, you have a choice of (some very clean!)REST APIs and SOAP APIs. Here's a sample REST request and response pair from the docs

Sample Request

    https://sdb.amazonaws.com/
    ?Action=Query
    &AWSAccessKeyId=[valid access key id]
    &DomainName=MyDomain
    &MaxNumberOfItems=3
    &NextToken=[valid next token]
    &QueryExpression=%5B%27Color%27%3D%27Blue%27%5D
    &SignatureVersion=1
    &Timestamp=2007-06-25T15%3A03%3A09-07%3A00
    &Version=2007-11-07
    &Signature=2wVXB1x0NSWWETwLylZPVP%2FtqXQ%3D

    Sample Response

    <QueryResponse xmlns="http://sdb.amazonaws.com/doc/2007-11-07">
    <QueryResult>
    <ItemName>eID001</ItemName>
    <ItemName>eID002</ItemName>
    <ItemName>eID003</ItemName>
    </QueryResult>
    <ResponseMetadata>
    <RequestId>c74ef8c8-77ff-4d5e-b60b-097c77c1c266</RequestId>
    <BoxUsage>0.0000219907</BoxUsage>
    </ResponseMetadata>
    </QueryResponse>

Eventual Consistency

This is going to surprise a lot of SimpleDB users (and probably cause a lot of hard bugs). Reading data from SimpleDB immediately after a write may not reflect the latest updates. SimpleDB relaxes the 'C' in ACID and doesn't promise that you'll instantly see your updates (due to it being propagated across all the copies of your data). Amazon may not have a choice here (see CAP Conjecture) but I don't think this is going to be popular with a lot of programmers.

Dare talks about this extensively and as someone writes code for a high traffic website with lots of data flowing around, I shudder at the prospect of not relying on data not being always up to date. For SimpleDB developers, this is going to mean some extensive coding to make their apps resistant to stale data - something programmers traditionally never had to worry about.

Another possibility is that frameworks could take away the pain of doing this checking - this is definitely going to be an interesting place to watch.

Query language

Unlike Google's BigTable which eschews any and all forms of querying (probably in favor of a map-reduce type paradigm), SimpleDB supports a simple set of query operators - =, !=, <, > <=, >=, STARTS-WITH, AND, OR, NOT, INTERSECTION AND UNION. Also, queries can only execute for a maximum of 5 seconds.

There are several interesting properties here

  • Since the data is split across several nodes, these queries must be getting palletized across several machines. There's been a ton of academic work on database query parallelism and you can be sure Amazon is making full use of it here .
  • The 'type-less' data model is going to lead to some peculiar query problems. For example, when comparing '13-5-2007' and '24-4-2006', do you compare it as a couple of dates or do you compare it as a couple of strings? The docs talk about this problem and asks developers to convert all dates to ISO 8601 format. This is probably another example of Amazon choosing implementation simplicity even if it means that the developers have to do a bit more work. This is not necessarily a bad thing - I've seen products (non-Microsoft and Microsoft) which try and do the opposite and they're not necessarily always successful.

The ecosystem and the competition

Amazon has built a good ecosystem around their services. Their services all work together (the same AWS keys can be used, the same X.509 certificate system,etc). The only thing missing is the ability to statically host a site completely on Amazon. What's even more surprising to me is that these are the sort of services that you would expect Google to release, given their much talked about infrastructure. As far as Microsoft goes, the only current service I can think of that comes close is Astoria (something which I should definitely spend more time digging into).

If Amazon does as good a job with this as they did with S3 and EC2, startups are going to love this service. Instead of having to shell out a ton of money up front and having to worry about dedicated hosting and colos, you now have a pay-as-you-go database in the sky.

Update #1

This post says that SimpleDB is built on Erlang. Interesting!

Update #2

See the Techcrunch post and the Techmeme discussion here

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License