Friday, February 08, 2008

With Amazon SimpleDB, the world's a string, and I'm feeling a little frayed



Everything in Amazon SimpleDB is stored as a string and compared as a string. The consequences of this are explored in some detail in the new article on Amazon's pages.

What's the reason for doing this? In another article the rationale is explained:
This provides application designers with the flexibility of not predefining different data types for their attributes, but rather changing them dynamically, as the application requires. A good example of why this flexibility is useful is “1984” – should it be treated as an integer, a title of a book, or a date? What happens if the application decides to store it as an integer, only to later realize that it was meant to be the title of a book and should be treated as a string? Amazon SimpleDB provides the flexibility of storing all data in one format, allowing developers to make data type decisions in the application layer without the data store enforcing constraints.
Some of the consequences are pretty stunning, especially when you're working with numbers.
The first step for representing the number ranges is to ensure that every number in the dataset is positive. This can be easily achieved by choosing an offset, such that it is larger than the module of the smallest expected negative number in your dataset. For example, if the smallest expected number in your dataset is -12,000, choosing offset = 100,000 may be safe.
Then they follow with an example:
  • Original dataset: {654, -12000, 3610, 0, -23}
  • Negative number offset: 100,000
  • Dataset with offset applied: {100654, 88000, 103610, 100000, 99977}
Then there is zero-padding:

Once all the numbers in the dataset are positive, there is another step necessary to ensure that values are properly represented for lexicographical comparisons. For example, number 2 and 10, if converted directly into strings "2" and "10" will not compare properly, as "10" comes before "2" in lexicographical order. However, if we zero-pad number 2 to be represented as "02", the comparison will execute as expected. But what happens if we decide to add number 200 to the dataset at a later point? Zero-padding with a single "0" will not work anymore. Therefore, application designers may follow the next steps for zero-padding:

  1. Determine the largest possible number in your dataset. Remember, that if you are using offsetting to take care of the negative number conversions, you have to take that into account as well by ensuring that your largest possible number is determined after you have added the offset to all the values.
  2. Based on the largest possible number, determine the maximum number of digits before the decimal point that you may have in your dataset.
  3. Convert all the numbers in your dataset to the proper string representation by appending as many "0" as necessary in front of each number to ensure that the total number of characters, representing the portion of the number before the decimal point, matches the maximum number of digits determined in Step 2 above.

Example:

  • Original dataset: {14.58, -12536.791, 20071109, 655378.34, -23}
  • Negative number offset: 100,000
  • Dataset with offset applied: {100014.58, 87463.209, 20171109, 755378.34, 99977}
  • Zero-padded dataset representation: {00100014.58, 00087463.209, 20171109, 00755378.34, 00099977}
  • Original query: ['attribute' > '500']
  • Converted query: ['attribute' > '00100500']
Simple, right?

What if you don't happen to know what your maximum or minimum value will be for the lifetime of your application? Oh well, you gave it your best shot, right? What are the consequences if you guess wrong? So some queries will be inaccurate. No biggie...

I'm not religiously attached to the relational data model and the theory behind it. I can see the value in having "loosely typed" data, which is what Simple DB is doing. The argument is this way you don't have make assumptions about your data early on.

But the problem is, you actually do have to make assumptions. You have to know (or guess) how big and how small the numbers may get.

And because you are applying padding and shifting, this means you really are make a firm decision early on that the value is an integer, and really are never planning to treat it as anything else.

So the argument of flexibility doesn't seem to be holding water for me. I start wondering if it's the constraints of their architecture, and not user flexibility, that is driving this model. It's like, SimpleDB is simple because it is a simple data model (key/values stored a strings), not because it is simple to use.

I remember at Sybase, one of its selling points was that with the relational model, and with stored procedures and triggers, you can centralize your business rules in the database. This means that regardless of what applications, old or new, happen to touch your data, they can't mess up the integrity of your data. This was, and is, very popular with larger companies trying to manage hundreds of applications touching their databases built by as many developers across the globe, experienced and otherwise.

With SimpleDB, the responsibility to maintain business rules and data integrity lies with the application tier. If you have lots of applications touching your data, it seems to me it's a recipe for confusion at the least and potential data corruption at the worst.

So, if I were building an architecture on top of Simple DB, I would seriously consider putting a very solid and unencroachable layer between my application code and SimpleDB to enforce the structure and rules I need to keep the data clean and accurate.

As I continue to blog, I am hearing and learning from readers that there are many, many perspectives and requirements, and usually the answer for any architectural choice is "it depends."

So, what do you think? How would you approach these aspects of SimpleDB? What's to like, what's not to like?

No comments: