I am a software architect working in service hosting area. I am interested and specialized in SaaS, Cloud computing and Parallel processing. Ricky is a DZone MVB and is not an employee of DZone and has posted 84 posts at DZone. You can read more from them at their website. View Full User Profile

Parallelism with Map/Reduce

  • submit to reddit
In this article, we will explore the Map/Reduce approach to turn a sequential algorithm into parallel

Overview of Map/Reduce

Since the "reduce" operation needs to accumulate results for the whole job, as well as having a communication overhead in sending and collecting data, the Map/Reduce model is more suitable for long running, batch-oriented jobs.

In the Map/Reduce model, "parallelism" is achieved via a "split/sort/merge/join" process and is described as follows.
  • A MapReduce Job starts from a predefined set of Input data (usually sitting in some directory of a distributed file system). A master daemon (which is a central co-ordinator) is started and gets the job configuration.
  • According to the job config, the master daemon will start multiple Mapper daemons as well as Reducer daemons in different machines. And then it starts the input reader to read data from some DFS directory. The input reader will chunk the read data accordingly and send them to a "randomly" chosen Mapper. This is the "split" phase and begins the parallelism.
  • After getting the data chunks, the mapper daemon will run a "user-supplied map function" and produce a collection of (key, value) pairs. Each item within this collection will be sorted according to the key and then sent to the corresponding Reducer daemon. This is the "sort" phase.
  • All items with the same key will come to the same Reducer daemon, which collects all the items of that key and invokes a "user-supplied reduce function" and produce a single entry (key, aggregatedValue) as a result. This is the "merge" phase.
  • The output of reducer daemon will be collected by the Output writer, which is effective the "join" phase and ends the parallelism.
Here is an simple word-counting example ...

Published at DZone with permission of Ricky Ho, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)


Dmitriy Setrakyan replied on Wed, 2008/05/21 - 2:13am

I don't understand why MapReduce is often pigeon-holed for offline processing of file data.

In GridGain, we have a different concept of MapReduce:

  1. Task can be any piece of Java code which can be executed on the Grid in real time (not limited to file data).
  2. During Map phase, a task gets split into multiple jobs and every jobs gets mapped to a specific grid node. Basically our key-value pairs are associations of jobs to grid nodes.
  3. As results from remote jobs start coming in, a result callback is invoked during which the task will decide whether to wait for more results, fail-over this result to another node (if the result is unsatisfactory), or start reducing.
  4. During Reduce phase, all results are aggregated into one compound result and returned to user (result aggregation).

The above MapReduce design is encapsulated within GridTask interface, which is our main abstraction. It can be invoked directly from Java via API, or by attaching @Gridify annotation to your Java methods when using our AOP-based grid-enabling.

Take a look at Grid Application In 15 Minutes screencast for a demonstration on how easily a simple Java "HelloWorld" application can be grid-enabled.

Dmitriy Setrakyan
GridGain - Grid Computing Made Simple

Ricky Ho replied on Wed, 2008/05/21 - 10:21pm

I agree that the Map/Reduce model, from an abstract model perspective is not restricted to batch oriented task.  But if you look at the current popular implementation, there are quite some overhead in network I/O, File I/O when splitting data, storing intermediate result ... etc.  While such overhead is insignificant for long-running batch tasks, they are typically not acceptable for online processing where response time is critical. 

Looking at its history, Map/Reduce model evolves from the solution Google use to do web crawling, indexing ... which are basically batch, offline task.

A common problem I frequently run into ...  a seqeuntial algorithm is redesigned for parallel execution and typically break down into multiple phases of Map/Reduce.  But a Reduce Task cannot be the Map task at the same method.  The reduce task has to store its result into the distributed file system, which will go through the schedule and then picked up by another Map task.  These extra I/O overhead wipe off most of the gain of parallelizing the algorithm.

Anyone has encountered similar issue ?  I would love to hear how other people deal with this.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.