Get front row seat and watch the development of micro-isv. We make cool product that solves all your problems...


Evaluating the API of Cassandra (BigTable)

There has been plenty of talk lately BigTable (Google) Cassandra (Facebook) and Hadoop (Yahoo). They all seem to expose a similar API. The Cassandra one is detailed on the wiki, but none of the descriptions I've read so far make sense. Here are my notes on Cassandra:

Firstly, there is the concept of a 'table name', but since there is no way to have two tables in one cluster, this provides nothing beyond a forwards compatibility hook for the FB guys.

The next concept is a 'Column Family': these have to be chosen when the cluster is started, and since there is no way to migrate data from one cluster to another, you better get them right first time. Each column family has a name and is either a 'super column' or a 'normal column'.

The database provides a (very restrictive) set of operations. I'm using [ ] to mean 'list of' and ( , ) for tuple construction. Anyone exposed to the ML series of languages will be familiar with the notation:

For a Normal Column

When 'family' is defined to be a normal column in the server configuration:

insert (family,key1,key2,value,timestamp) → ()

get (family,key1,key2) → (value,timestamp)

get_range (family,key1) → [(value,timestamp)]

For a Super Column

When 'family' is defined to be a super column in the server configuration:

insert (family,key1,key2,key3,value,timestamp) → ()

get (family,key1,key2) → [(key3,value,timestamp)]

get_range (family,key1) → [(key2,[(key3,value,timestamp)]]

Note that there isn't (at least in my understanding) an operation on super columns like

lookup: (family, key1, key2, key3) → (value, timestamp)

Note that you can only get back a list of (key3,value,timestamps) from a super column. This makes super columns nearly the same as just storing a list of (key3,values) in a normal column, except that new items can be added freely, and the timestamps are per value.


I don't understand remove operations yet:

# insert an item
# ./Cassandra-remote insert users fred edges:a:g fff 48

# check it got added
# ./Cassandra-remote get_slice_super users fred edges -1 -1
[ {'name': 'a', 'columns': [{'columnName': 'g', 'value': 'fff', 'timestamp': 48}]}]

# Delete it
# ./Cassandra-remote remove users fred edges:a

# Check it has gone
# ./Cassandra-remote get_slice_super users fred edges -1 -1

# Add it back
# ./Cassandra-remote insert users fred edges:a:g fff 50

# ... and it doesn't get added?
# ./Cassandra-remote get_slice_super users fred edges -1 -1

Update: Jonathan Ellis notes that some of this is getting old now, specifically that remove is fixed. When the first release is ready, I'll refresh this analysis. Personally I prefer Cassandra over Hadoop/HBase because Hadoop has a central single point of failure.