Friday, July 16, 2010

Apache Cassandra: materialized queries (one table per query please)

Max Grinev has done an excellent job giving a quick overview of Cassandra here, and then has a followup blog post that discusses how to accomplish various SQL-style operations such as select, join, and group by.  Very helpful, thanks, Max!

Note that in every example, you don't write a query against the existing data.  Instead, you create a new column family to represent the query.  It's as if you write your query as a data structure.   It's basically a materialized view.   Max argues that this is OK for most applications: "However, typically in Web applications and enterprise OLTP applications queries are well known in advance, few in number, and do not change often."

OK, fair enough, but it is definitely something to keep in mind: if you want to support ad-hoc queries in your application, then Cassandra is probably not the right choice.

It also looks like it is your job as a user of Cassandra to update all your various views when data changes.   I agree with Maxim that denormalization and "push on change" rather than "pull on demand" is probably the right approach for highly intertwined systems like Twitter and Facebook.  But it would be nice if the system helped you maintain consistency across all these denormalized copies.

For example, CouchDB has a very similar model of materialized views, but their views are defined as map/reduce operations on the primary document(s), and CouchDB takes care of keeping the views in sync as you update your data.  That saves me a lot of time and worry.

Perhaps something like that exists in Cassandra and I don't know about it?

Anyway, I've been very interested in Cassandra, and Max's blogs were quite helpful in  helping me get my head around it.  I highly recommend reading them if you'd like to learn more about Cassandra, or what a distributed column-oriented key/value store looks like (Cassandra is modeled after Google's BigTable).

No comments: