Recently, there's been growing support to change the terminology we use
to describe the data model of Cassandra. This has people somewhat
divided and although I've gone on record
as supporting the decision. I too am a bit torn. I can appreciate
both perspectives, and there are both risks and rewards associated with
The two controversial terms are Keyspace
and Column Family
. The terms roughly
correlate to the more familiar relational equivalents: Schema
I think that it is a fairly easy transition to change from Keyspace to
Schema. Logically speaking, in relational databases, a schema is
collection of tables. Likewise, in Cassandra, a Keyspace is a
collection of Column Families.
The sticky point is Column Family. Conceptually, everyone can visualize a table as an nxm
matrix of data. Although you can mentally map a Column Family into that same logical construct, buyer beware.
A data model for a column-oriented database is typically *much*
different from an analogous model designed for an RDBMS. To achieve the
same capabilities that a relational database provides on tables, you
need to model your data differently to support "standard" relational
queries. Assuming a column family has the same capabilities as a table
will lead you to all sorts of headaches. (e.g. consider Range Queries and Indexing
When data modeling, I don't relate column families to tables at all.
For me, its easier to think of column families as a map of maps. Then
just remember that the top-level map can be distributed across a set of
machines. Using that mental model you are more likely to create a data
model that is compatible with a column-oriented database. Think of
column families as tables, and you may get yourself into trouble that
will require significant refactoring.
With a strong movement towards polyglot persistence architectures, and
tools that need to span the different persistence mechanisms, I can see a
strong motivation to align terminology. (Consider ETL tools (e.g.
Talend), design tools (e.g. Erwin), even SQL clients (e.g. good old
The popularity of Cassandra's CQL is further evidence that people want
to interact with NoSQL databases using tried-and-true SQL (ironically).
And maybe we should "give the people what they want" especially if it
simultaneously eases the transition for new comers.
The Big Picture:
Theologically, and in an ideal world, I agree with Jonathan's point
"The point is that thinking in terms of the storage engine is difficult
and unnecessary. You can represent that data relationally, which is
the Right Thing to do both because people are familiar with that world
and because it decouples model from representation, which lets us
change the latter if necessary"
Pragmatically, I've found that it is often necessary to consider the
storage engine at least until that engine has all the features and
functions that allow me to ignore it.
Realistically, any terminology change is going to take a long time. The
client APIs probably aren't changing anytime soon, (Hector, Astyanax,
etc.) and the documentation
still reflects the "legacy" terminology. It's only on my radar because we decided to evolve the terminology in the RefCard
that we just released.
Only time will tell what will come of "The Great Cassandra Terminology
Debates of 2012", but guaranteed there will be people on both sides of
the fence -- as I find myself occasionally straddling it. =)