NoSQL Zone is brought to you in partnership with:

Open-source software enthusiast, Node.js and PHP expert, Lean & Agile management fan, Leadership & motivation experimenter, CEO at marmelab Francois has posted 12 posts at DZone. You can read more from them at their website. View Full User Profile

Comparing MongoDB New Aggregation Framework and SQL

03.01.2013
| 7684 views |
  • submit to reddit

MongoDB 2.1 introduced the aggregation framework, a faster alternative to Map/Reduce for common aggregation operations. If you took a look at the documentation and examples, you may have found the feature intimidating. Once you tame it, this new feature reveals itself as a very powerful beast. So read on to discover its true power through a series of examples.MongoDB New Aggregation Framework And SQL side-By-Side

A Brief Introduction About The Pipeline Syntax

A MongoDB aggregation is a series of special operators applied to a collection. An operator is a JavaScript object with a single property, the operator name, which value is an option object:

{ $name: { /* options */ } }

Supported operator names are: $project, $match, $limit, $skip, $unwind, $group, and $sort, each with their own set of options. A series of operators is called a pipeline:

[{ $project: { /* options */ } }, { $match: { /* options */ } }, { $group: { /* options */ } }]

When executing a pipeline, MongoDB pipes operators into each other. “Pipe” here takes the Linux meaning: the output of an operator becomes the input of the following operator. The result of each operator is a new collection of documents. So Mongo executes the previous pipeline as follows:

collection | $project | $match | $group => result

You can add as many operators to a pipeline as you like, even twice the same one, at different positions:

collection | $match | $group | $match | $project | $group => result

That explains why a pipeline is not written as a simple JavaScript object, but rather as a collection of objects: in an object, the same operator couldn’t appear twice:

// The first appearance of $match and $group would be ignored with this syntax
{
  $match:   { /* options */ },
  $group:   { /* options */ },
  $match:   { /* options */ },
  $project: { /* options */ },
  $group:   { /* options */ }
}
// So MongoDB imposes a collection of JavaScript objects instead
[
  { $match:   { /* options */ } },
  { $group:   { /* options */ } },
  { $match:   { /* options */ } },
  { $project: { /* options */ } },
  { $group:   { /* options */ } }
]
// That's longer and cumbersome to read, but you'll get used to it

To execute a pipeline on a MongoDB collection, use the aggregate() function on that collection:

db.books.aggregate([{ $project: { title: 1 } }]);

Tip: If you're using Node.js, both the native adapter (since v0.9.9.2) and the ODM (since v3.1.0) support the new aggregation framework. For instance, to execute the previous pipeline on a Mongoose model, you just need to write:

Books.aggregate([{ $project: { title: 1 } }], function(err, results) {
  // do something with the result
});

The main benefit of the aggregation framework is that MongoDB executes it without the overhead of the JavaScript engine. It's implemented directly in C++, and therefore it's very fast. The main limitation - as compared to classical SQL aggregation - is that it’s limited to a single collection. In other terms, you can’t do a Mongo aggregation on several collections using a JOIN-like operation. Apart from that, it’s very powerful.

In this post, I’ll illustrate the power of pipeline operators by example, and compare them to their SQL counterpart. For a detailed reference, go to docs.mongodb.org.

Select, alias, compose

Use the $project operator to select or rename properties from a collection - similar to what you would do with the SQL SELECT clause.

/ sample data
> db.books.find();
[
  { _id: 147, title: "War and Peace", ISBN: 9780307266934 },
  { _id: 148, title: "Anna Karenina", ISBN: 9781593080273 },
  { _id: 149, title: "Pride and Prejudice", ISBN: 9783526419358 },
]

	

# sample data
> SELECT * FROM book;
+-----+-----------------------+---------------+
| id  | title                 | ISBN          |
+-----+-----------------------+---------------+
| 147 | 'War and Peace'       | 9780307266934 |
| 148 | 'Anna Karenina'       | 9781593080273 |
| 149 | 'Pride and Prejudice' | 9783526419358 |
+-----+-----------------------+---------------+

> db.books.aggregate([
  { $project: {
    title: 0,           // eliminate from the output
    reference: "$ISBN"  // use ISBN as source
  } }
]);
[
  { _id: 147, reference: 9780307266934 },
  { _id: 148, reference: 9781593080273 },
  { _id: 149, reference: 9783526419358 },
]

	

> SELECT id, ISBN AS reference FROM book;
+-----+---------------+
| id  | reference     |
+-----+---------------+
| 147 | 9780307266934 |
| 148 | 9781593080273 |
| 149 | 9783526419358 |
+-----+---------------+

The $project operator can also create composed fields and sub-documents using any of the supported expression operators ($and, $or, $gt, $lt, $eq, $add, $mod, $substr, $toLower, $toUpper, $dayOfWeek, $hour, $cond, $ifNull, to name a few).

Grouping documents

Group documents with, as you would have guessed, the $group operator.

// fastest way
> db.books.count();
3
// if you really want to use aggregation
> db.books.aggregate([
  { $group: {
    // _id is required, so give it a constant value 
    // to group all the collection into one result
    _id: null,
    // increment nbBooks for each document
    nbBooks: { $sum: 1 }
  } }
]);
[
  { _id: null, nbBooks: 3 }
]

	

> SELECT COUNT(*) FROM book;
+----------+
| COUNT(*) |
+----------+
| 3        |
+----------+

// sample data
> db.books.find()
[
  { _id: 147, title: "War and Peace", author_id: 72347 },
  { _id: 148, title: "Anna Karenina", author_id: 72347 },
  { _id: 149, title: "Pride and Prejudice", author_id: 42345 }
]

	

# sample data
> SELECT * FROM book
+-----+---------------------+-----------+
| id  | title               | author_id |
+-----+---------------------+-----------+
| 147 | War and Peace       | 72347     |
| 148 | Anna Karenina       | 72347     |
| 149 | Pride and Prejudice | 42345     |
+-----+---------------------+-----------+

> db.books.aggregate([
  { $group: {
    // group by author_id
    _id: "$author_id",
    // increment nbBooks for each document
    nbBooks: { $sum: 1 }
  } }
]);
[
  { _id: 72347, nbBooks: 2 },
  { _id: 42345, nbBooks: 1 }
]

	

> SELECT author_id, COUNT(*)
  FROM book
  GROUP BY author_id;
+-----------+----------+
| author_id | COUNT(*) |
+-----------+----------+
| 72347     | 2        |
| 42345     | 1        |
+-----------+----------+

Multi Operator Pipeline

A pipeline can feature more than one operator. Here is a combination of $group and $project:

> db.books.aggregate([
  { $group: {
    _id: "$author_id",
    nbBooks: { $sum: 1 }
  } },
  { $project: {
    _id: 0,
    authorId: "$_id",
    nbBooks: 1
  } }
]);
[
  { authorId: 72347, nbBooks: 2 },
  { authorId: 42345, nbBooks: 1 }
]

	

> SELECT author_id AS author, COUNT(*) AS nb_books
  FROM book
  GROUP BY author_id;
+--------+----------+
| author | nb_books |
+--------+----------+
| 72347  | 2        |
| 42345  | 1        |
+--------+----------+

More complex aggregations

$group supports a lot of aggregation functions: $first, $last, $min, $max, $avg, $sum, $push, and $addToSet. Check the MongoDB documentation for a complete reference.

// sample data
> db.reviews.find();
[
  { _id: "455", bookId: "974147",
    date: new Date("2012-07-10"), score: 1 },
  { _id: "456", bookId: "345335",
    date: new Date("2012-07-12"), score: 5 },
  { _id: "457", bookId: "345335",
    date: new Date("2012-07-13"), score: 2 },
  { _id: "458", bookId: "974147",
    date: new Date("2012-07-16"), score: 3 }
]

	

# sample data
> SELECT * FROM review;
+-----+---------+--------------+-------+
| id  | book_id | date         | score |
+-----+---------+--------------+-------+
| 455 | 974147  | "2012-07-10" | 1     |
| 456 | 345335  | "2012-07-12" | 5     |
| 457 | 345335  | "2012-07-13" | 2     |
| 458 | 974147  | "2012-07-16" | 3     |
+-----+---------+--------------+-------+

> db.reviews.aggregate([
  { $group: {
    _id: "$bookId",
    avgScore:  { $avg: "$score" },
    maxScore:  { $max: "$score" },
    nbReviews: { $sum: 1 }
  } }
]);
[
  { _id: 345335, avgScore: 3.5, maxScore: 5, nbReviews: 2 },
  { _id: 974147, avgScore: 3, maxScore: 3, nbReviews: 2 }
]

	

> SELECT book_id,
         AVG(score) as avg_score,
         MAX(score) as max_score,
         COUNT(*) as nb_reviews
  FROM review
  GROUP BY book_id ;
+---------+------------+----------+------------+
| book_id | avg_score | max_score | nb_reviews |
+---------+------------+----------+------------+
| 345335  | 3.5       | 5         | 2          |
| 974147  | 2         | 3         | 2          |
+---------+------------+----------+------------+

Conditions

You can restrict the collection to be processed using a query object, passed to the $match operator. Whether you place this operator before or after a $group operator, it becomes the equivalent of WHERE or HAVING in SQL.

> db.reviews.aggregate([
  { $match : {
    date: { $gte: new Date("2012-07-11") }
  } },
  { $group: {
    _id: "$bookId",
    avgScore: { $avg: "$score" }
  } }
]);
[
  { _id: 345335, avgScore: 3.5 },
  { _id: 974147, avgScore: 3 }
]

	

> SELECT book_id, AVG(score)
  FROM review
  WHERE review.date > "2012-07-11"
  GROUP BY review.book_id ;
+---------+------------+
| book_id | AVG(score) |
+---------+------------+
| 345335  | 3.5        |
| 974147  | 3          |
+---------+------------+

> db.reviews.aggregate([
  { $group: {
    _id: "$bookId",
    avgScore: { $avg: "$score" }
  } },
  { $match : {
    avgScore: { $gt: 3 }
  } }
]);
[
  { _id: 345335, avgScore: 3.5 }
]

	

> SELECT book_id, AVG(score) AS avg_score
  FROM review
  GROUP BY review.book_id
  HAVING avg_score > 3;
+---------+------------+
| book_id | AVG(score) |
+---------+------------+
| 345335  | 3.5        |
+---------+------------+

Develop Embedded Arrays

If documents inside a collection contain arrays, you can develop ("unwind") these arrays into several unique documents using the $unwind operator.

// sample data
> db.articles.find();
[
  {
    _id: 12351254,
    title: "Space Is Getting Closer",
    tags: ["science", "space", "iss"]
  },
  {
    _id: 22956492,
    title: "Computer Solves Rubiks Cube",
    tags: ["computing", "science"]
  }
]

	

# sample data
> SELECT * FROM article;
+------------+---------------------------+
| id       | title                       |
+----------+-----------------------------+
| 12351254 | Space Is Getting Closer     |
| 22956492 | Computer Solves Rubiks Cube |
+------------+---------------------------+
> SELECT * FROM tag;
+-----+------------+-----------+
| id  | article_id | name      |
+-----+------------+-----------+
| 534 | 12351254   | science   |
| 535 | 12351254   | space     |
| 536 | 12351254   | iss       |
| 816 | 22956492   | computing |
| 817 | 22956492   | science   |
+-----+------------+-----------+

> db.articles.aggregate([
  { $unwind: "$tags" }
]);
[
  {
    _id: 12351254,
    title: "Space Is Getting Closer",
    tags: "science"
  },
  {
    _id: 12351254,
    title: "Space Is Getting Closer",
    tags: "space"
  },
  {
    _id: 22956492,
    title: "Computer Solves Rubiks Cube",
    tags: "computing"
  },
  {
    _id: 22956492,
    title: "Computer Solves Rubiks Cube",
    tags: "science"
  }
]

	

> SELECT article.id, article.title, tag.name
  FROM article LEFT JOIN tag
  ON article.id = tag.article_id;
+------------+-----------------------------+-----------+
| article.id | article.title               | tag.name  |
+------------+-----------------------------+-----------+
| 12351254   | Space Is Getting Closer     | science   |
| 12351254   | Space Is Getting Closer     | space     |
| 22956492   | Computer Solves Rubiks Cube | computing |
| 22956492   | Computer Solves Rubiks Cube | science   |
+------------+-----------------------------+-----------+

Aggregate Developed Arrays

The true power of the aggregation framework reveals when you pipe $unwind to $group. This is similar to using LEFT JOIN ... GROUP BY in SQL.

> db.articles.aggregate([
  { $unwind: "$tags" },
  { $group: {
    _id: "$tags",
    nbArticles: { $sum: 1 }
  } }
]);
[
  { _id: "science", nbArticles: 2 },
  { _id: "space", nbArticles: 1 },
  { _id: "computing", nbArticles: 1 },
]

	

> SELECT tag.name, COUNT(article.id) AS nb_articles
  FROM article LEFT JOIN tag
  ON article.id = tag.article_id
  GROUP BY tag.name;
+-----------+-------------+
| tqg.name  | nb_articles |
+-----------+-------------+
| science   | 2           |
| space     | 1           |
| computing | 1           |
+-------------+-----------+

> db.articles.aggregate([
  { $unwind: "$tags" },
  { $group: {
    _id: "$tags",
    articles: { $addToSet: "$_id" }
  } }
]);
[
  { _id: "science", articles: [12351254, 22956492] },
  { _id: "space", articles: [12351254] },
  { _id: "computing", articles: [22956492] },
]

	

> SELECT tag.name, GROUP_CONCAT(article.id) AS articles
  FROM article LEFT JOIN tag
  ON article.id = tag.article_id
  GROUP BY tag.name;
+-----------+-------------------+
| tqg.name  | articles          |
+-----------+-------------------+
| science   | 12351254,22956492 |
| space     | 12351254          |
| computing | 22956492          |
+-------------+-----------------+

Conclusion

Imagine what you can do with this system... Pipe operators one after the other, group, sort, limit, etc. The ultimate example, taken from the MongoDB documentation itself, shows a pipeline with two successive $group operators. An SQL database can only do that with subqueries.

If your Map/Reduce functions are simple enough, refactor your Mongo code to the new aggregation framework. It will execute faster, and open to a new realm of possibility.



Published at DZone with permission of its author, Francois Zaninotto. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)