Big Data/Analytics Zone is brought to you in partnership with:

I'm a Lead engineer at Terracotta, Software AG. I have 10+ years of experience with Java, but I also speak other languages, C, C++, Javascript, Coffeescript, PHP, Perl... See my blog to know about my different personal projects! Aurelien is a DZone MVB and is not an employee of DZone and has posted 17 posts at DZone. You can read more from them at their website. View Full User Profile

Confused About Map/Reduce?

11.28.2012
| 22837 views |
  • submit to reddit

 I was working on some Hadoop stuff recently, and as a total beginner, I found that the Map/Reduce concept was not easy to understand, despite the huge number of tutorials.
The Wordcount example is the ‘Hello World’ of Hadoop, but when I prepared a small presentation for my team, I realized it was not clear enough to explain Map/Reduce in 5 minutes.

As you may already know, the Map/Reduce pattern is a pattern that is very good for embarrassingly parallel algorithms.

Okayyyy but… What is an embarrassingly parallel algorithm?
Answer: It is an algorithm that is very well fit to be executed multiple times in parallel.

Ok then… what is very well suited for a parallel execution?
Answer: Any algorithm that’s working on data that can be isolated.

When writing an application, if you execute multiple occurrences of it at the same time, and they need to access some common data, there will be some clash, and you will have to handles cases like when one occurrence is changing some data while another other is reading it. You’re doing concurrency.
But if your occurrence is working on some data that no other occurrence will need, then you’re doing parallelism. Obviously you can scale further, since you do not have concurrency issues.

So let’s take an example, let’s say you have a list of cities, and each one has two attributes : the state it belongs to, and its yearly average temperature. E.g. : San Francisco : {CA, 58}
Now you want to calculate the yearly average temperature BY STATE.
Since you can group cities by state, and calculate the average temperature of a state without caring about cities of other states, you have a great embarrassingly parallel algorithm candidate.

If you wanted to do it sequentially, you would start with an empty list of yearly state average temperatures. Then you would iterate through the list of cities, and for each city, look at the state, then update the relevant yearly state average temperature.

Fortunately, it’s very easy to do it in parallel instead.

Let’s have a look at this map:

This is a map of India. There are several states : MP, CG, OR… And several cities, each one having {State, City average temperature} as value.

We want here to calculate the yearly average per state. In order to do that, we should group the city average temperatures by state, then calculate the average of each group.

We don’t really care about the city names, so we will discard those and keep only the state names and cities Temperatures.

Now we have only the data we need, and we can regroup the temperatures values by state. We’re going to get a list of temperatures averages for each state.

At this point, we have the data in good shape to actually do the maths… All we have to do is to calculate the average temperature for each state

That wasn’t hard.

We had some input data. We did a little regrouping, then we did the calculation. And all this could be executed in parallel (One parallel task for each state).

Well… That was Map/Reduce!

Let’s do it again

Map/Reduce has 3 stages : Map/Shuffle/Reduce

The Shuffle part is done automatically by Hadoop, you just need to implement the Map and Reduce parts.

You get input data as <Key,Value>  for the Map part.

In this example, the Key is the City name, and the Value is the set of  attributes : State and City yearly average temperature.

Since you want to regroup your temperatures by state, you’re going to get rid of the city name, and the State will become the Key, while the Temperature will become the Value.

Now, the shuffle task will run on the output of the Map task. It is going to group all the values by Key, and you’ll get a List<Value>

And this is what the Reduce task will get as input : the Key, List<Value> from the Shuffle task.

The Reduce task is the one that does the logic on the data, in our case this is the calculation of the State yearly average temperature.

And that’s what we will get as final output

This is how the data is shaped across Map/Reduce:

Mapper <K1, V1> —> <K2, V2>
Reducer <K2, List<V2>> —><K3, V3>

I hope this helped makes things a bit clearer about Map/Reduce, if you’re interested in explanations about Map Reduce v2/YARN, just leave a comment and I’ll post another entry.

PS: The java code for this example can be found here:

https://github.com/jsoftbiz/mapreduce_1


Published at DZone with permission of Aurelien Broszniowski, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Paul Kofon replied on Wed, 2012/11/28 - 1:47pm

Aurelien, this is a very clearly-written and useful intro to the concept of Map/Reduce; especially for those of us new to the subject of "Big data", like myself. I look forward to your next post and thanks for this one.

Mason Mann replied on Wed, 2012/11/28 - 3:34pm

That was a mighty complicated way of saying "distributed hierarchy of workers", which map/reduce really is.

John Burgess replied on Wed, 2012/12/05 - 4:45am

 Good stuff.  Thanks for making this a lot clearer.  I'd like to see your take on Map Reduce 2/YARN as well.


Thanks again!

Wolfgang Killinger replied on Wed, 2012/12/05 - 5:31am

 Makes the basic conecpt and procedure much cleares. Please  follow up with additional articles in this style.

Daniel Sitnik replied on Wed, 2012/12/05 - 6:55am

Great example!

Very easy to follow and understand.

Thanks for sharing the code too.

Sandeep Purbiya replied on Wed, 2012/12/05 - 8:14am


Wow!! you made it quite easy to understand it. Thanks and look forward to your next post.

Ravinder Pasula replied on Wed, 2012/12/05 - 2:33pm

 Really great explanation of Map/Reduce. I always wondered the way it works - the concept is now well understand - Thanks a lot Aurelien

Tom Clancy replied on Fri, 2012/12/07 - 6:14am

 Well done! The best article explaining MapReduce I've read lately!!

Guilherme Russi replied on Thu, 2014/02/13 - 2:51pm

Hello, I'm starting with Hadoop 2.2.0 and I'd like to know how can I run your program, I mean, how do I call it with Hadoop? Thank you.

Palanisamy Rama... replied on Mon, 2014/08/11 - 1:21am

I can't thank you enough for clarifying the MapReduce concept with a very simple example. Awesome!!! Thank you so much.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.