NoSQL Zone is brought to you in partnership with:

Tharindu holds a first class honors degree in computer science and engineering from the University of Moratuwa, Sri Lanka. He also received a professional postgraduate diploma in marketing from the CIM, UK, where he is an associate member. Tharindu currently works at WSO2. He is a Associate Tech Lead and a member of the data technologies management committee, focusing on big data, analytics, and business activity monitoring (BAM). Tharindu is a DZone MVB and is not an employee of DZone and has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

Doing Data Modeling in Cassandra

06.28.2012
| 3436 views |
  • submit to reddit

I wish I found a post similar to this, listing the practical problems of designing a data model using Cassandra (or other NoSQL DB) and with corresponding answers to overcome them. Don’t misinterpret what I’m saying, there are tons of good material out there, giving good introductory tutorials (WTF is a super column – An intro to the Cassandra data model). But I found very few articles actually listing the roadblocks you will run into when you start doing some real work with this beast. This blog is my effort to address this concern, hoping that others will be saved from that pain.

The most important point that applies to many people coming into using Cassandra from a relational DB background, which is like 99.9% of the people, is that

NoSQL means no SQL

Although this sounds retarded, and completely obvious, it is one of the most important points to understand. I say this because I see so many people not understanding this and diving into NoSQL hoping that it will be the answer to global warming.

Inserting data into NoSQL is pretty easy, IMO, and there are plenty of articles out there which help you do it. It’s the reading that’s hard. Why is that? That’s because there is no SQL. There is no language that you can use to query the data that you inserted. So if you are like me and started by creating random UUIDs as row keys and wonder how I’m going to query this data, you are now warned!

Let’s take an example, out of WSO2 BAM (This is the code base I’m most familiar with). The BAM server stores events received from other servers for monitoring. So a typical data structure would look like this.

Event CF (Column Family) = { “Timestamp + UUID” : { “activity_id” : “some_id” , “message_body” : “<xml>some xml body</xml>” ….. “other_event_attributes” : “corresponding_values” }

So, how do we make use of this data now? Let’s see how a typical RDBMS guy thinks. He’ll want something like ‘I want to get all messages that contains “WTF” in the message body’. So, how do we do this? Simple answer is ‘you can’t’. Now, that sucks! That’s just a basic thing you can do in a RDBMS with SQL. That’s true, if that’s what you wanted why did you switch to Cassandra a ‘NoSQL’ data storage (Refer this famous video on SQL vs NoSQL – http://www.youtube.com/watch?v=b2F-DItXtZs).

So that we now regret switching to Cassandra, let’s see how this can be solved. We do something that RDBMSs do in silence.  We make use of the property where each column can have an arbitrary number of rows. We create a column family (CF) named, Message_Search. Let’s use this CF to store an index of words in the message_body field. For example, now when we insert an event we received into Cassandra, this is how we go.

1. We parse and insert the event in the Event CF and get the reference to that particular event. Ex:

{ “10:43 PM 02-Nov-2011 + abcd-3hbs-2123-sfasf” : { “activity_id” : “asfd-1212-sdfs-2124″ , message_body : “WTF do you mean Cassandra sucks ? ” }

2. Now, we get the row key and keep it with us. Let foo = 10:43 PM 02-Nov-2011 + abcd-3hbs-2123-sfasf

3. Now we split the message_body based on white spaces and store each word as a row key in Message_Search. For example, “WTF” will be a row key and columns keys will be the event Id, i.e. Timestamp + UUID. (We will not use column values here, so will keep it blank)

Message_Search Column Family = { “WTF” : { “foo”, “some_other_event_uuid” } } , { “do” :  “foo”} } , { “you” :  “foo”} } , { “mean” :  “foo”} } , { “Cassandra” :  “foo”} } , { “sucks” :  “foof”} } , { “?” : { “foo”} }

OK! Most of you should figure out now how we can query for “WTF” in the message_body field. Here’s the boring explanation for completeness sake.

Now when we get the query,

1. We go to the Message_Search CF, with “WTF” as the row key and get the corresponding columns. We get, “foo” (Remember foo =  10:43 PM 02-Nov-2011 + abcd-3hbs-2123-sfasf) and “some_other_event_uuid” as results.

2. Now we go to the Event CF and query with the row keys “foo” and “some_other_event_uuid”, and we will get the events.

3. To display the message_body of the events, we just get the column value of the “message_body” column keys.

It’s so simple, isn’t it? But why do we need to go through all this trouble, when we can just go to an RDBMS. Well, you should if you could. But for the product I lead, i.e. WSO2 BAM, we deal with tremendous amounts of data mounting to 100s of TBs and we need the kind of big data storage scalability that Cassandra provides. Oh, and did I mention that it’s super fast. Of course, as you saw above, all that comes with a price.

END

Published at DZone with permission of Tharindu Mathew, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)