NoSQL Zone is brought to you in partnership with:

Dan is co-founder of Fiesta - https://fiesta.cc. Before Fiesta, he was a developer working on ShopWiki. Daniel is a DZone MVB and is not an employee of DZone and has posted 8 posts at DZone. You can read more from them at their website. View Full User Profile

A Walkthrough of MongoDB Data Modeling

03.13.2012
| 7759 views |
  • submit to reddit

Walkthrough: MongoDB Data Modeling 5 months ago

Last week’s post about MongoDB Map/Reduce was pretty well received, so it seems like there is a need for some more discussion of the details involved in real-world MongoDB deployments. I thought we’d try and do a couple more posts and walk through some more details about how we’re using MongoDB at Fiesta.

Flexibility

One of the most touted features of MongoDB is its flexibility. I personally have emphasized flexibility in countless talks introducing MongoDB to technical audiences. Flexibility, however, is a double-edged sword; more flexibility means more choices to face when deciding how to model data (this reminds me of the Zen of Python: “There should be one - and preferably only one - obvious way to do it”). Nevertheless, I like the flexibility that MongoDB provides, it’s just important to review some best practices before settling on a data model.

The Problem

In this post we’ll take a look at how we’ve modeled mailing lists and the people that belong to them. Here are the requirements:

  • Each person can have one or more email addresses.
  • Each person can belong to any number of mailing lists.
  • Every person who belongs to a mailing list can choose what name they want to use for the list.

These requirements have obviously been simplified somewhat, but they are enough to express the core mechanics that power Fiesta.

0-Embed

Let’s examine how our data model looks if we never embed anything - we’ll call this a 0-embed strategy.

We have People, who have a name and password:

{
  _id: PERSON_ID,
  name: "Mike Dirolf"
  pw: "Some Hashed Password"
}

We have a separate collection of Addresses, where each address maintains a reference to a single Person:

{
  _id: ADDRESS_ID,
  person: PERSON_ID,
  address: "mike@corp.fiesta.cc"
}

We have Groups, each of which is basically just an ID (IRL there is some more group-specific metadata that would be in here as well, but we’re going to ignore it to focus on the relationships):

{
  _id: GROUP_ID
}

Lastly, we have Memberships, which associate a Person with a Group. Each Membership includes the list name that the Person is using for the Group, and a reference to the Address that they want to receive mail at for that Group:

{
  _id: MEMBERSHIP_ID,
  person: PERSON_ID,
  group: GROUP_ID,
  address: ADDRESS_ID,
  group_name: "family"
}

This data model is easy to design, simple to reason about, and easy to maintain. We are basically modeling the data as we would in an RDBMS, though; we aren’t leveraging MongoDB’s document-oriented approach. For example, let’s walk through how we would get the other member addresses of a group, given a single incoming address and group name (this is a very common query for Fiesta):

  1. Query the Addresses collection to get the ID of the relevant Person.
  2. Query the Memberships collection with the Person ID from step 1 and the group name to get the Group ID.
  3. Query the Memberships collection again to get all of the Memberships with the Group ID from step 2.
  4. Query the Addresses collection to get the Address to use for each of the Memberships from step 3.

Things get a bit complicated :).

Embed Everything

The strategy that a lot of newcomers use when modeling their data is what we’ll call the embed everything strategy. To use this strategy for Fiesta, we’d take all of a Group’s Memberships and embed them directly within the Group document. We’d also embed Addresses and Person metadata directly within each Membership:

{
  _id: GROUP_ID,
  memberships: [{
    address: "mike@corp.fiesta.cc",
    name: "Mike Dirolf",
    pw: "Some Hashed Password",
    person_addresses = ["mike@corp.fiesta.cc", "mike@dirolf.com", ...],
    group_name: "family"
  }, ...]
}

The theory behind the embed everything strategy is that by keeping all of the related data in one place we can make common queries a lot simpler. With this strategy, the query we performed above is trivial (remember, the query is “given an address and group name, what are the other member addresses of the group”):

  1. Query the Groups collection for a group containing a membership where the address is in person_addresses and the group_name matches.
  2. Iterate over the resulting document to get the other membership addresses.

That’s about as easy as it gets. But what if we wanted to change a Person’s name or password? We’d have to change it in every single embedded membership. Same goes for adding a new person_address or removing an existing one. This highlights the characteristics of the embed everything model: it can be great for doing a single specific query (because we’re basically pre-joining), but can be a nightmare for long-term maintainability. I’d highly recommend against this approach in general.

Embed Trivial Cases

The approach we’ve taken at Fiesta, and the approach I most often recommend, is to start by thinking about the 0-embed model. Once you’ve got that model figured out, you can pick off easy cases where embedding just makes sense. A lot of the time these cases tend to be one-to-many relationships.

For example, our Addresses each belong to a single user (and are also referenced by Memberships). Addresses are also not likely to change very often. Let’s embed them as an array to save some queries and keep our data model in sync with our mental model of a Person.

Memberships are each associated with a single Person and a single Group, so we could imagine embedding them in either the Person model or the Group model. In cases like this, it’s important to think about both data access patterns and the magnitude of relationships. We expect People to have at most 1000s of group Memberships, and Groups to have at most 1000s of Memberships as well, so the magnitude doesn’t tell us much. Our access pattern, however, does - when we display the Fiesta dashboard we need to have access to all of a Person’s Memberships. To make that query easy, let’s embed Memberships within the Person model. This also has the advantage of keeping a Person’s addresses all within the Person model (since they are referenced both at the top-level and within Memberships). If an address needs to be removed or changed, we can do it all in one place.

Here’s how things look now (this is the Person model - the only other model is Group, which is identical to the 0-embed case):

{
  _id: PERSON_ID,
  name: "Mike Dirolf",
  pw: "Some Hashed Password",
  addresses: ["mike@corp.fiesta.cc", "mike@dirolf.com", ...],
  memberships: [{
    address: "mike@corp.fiesta.cc",
    group_name: "family",
    group: GROUP_ID
  }, ...]
}

The query we’ve been discussing now looks like this:

  1. Query for a Person with the matching address and an embedded Membership with the right group_name.
  2. Use the Group ID in the embedded Membership from step 1 to query for other People with Memberships in that Group - get the addresses directly from their embedded Memberships.

It’s still almost as simple as in the embed everything case, but our data model is a lot cleaner and easier to maintain. Hopefully this walkthrough has been helpful - if you have any questions let us know!

Published at DZone with permission of Daniel Gottlieb, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)