Design for parallelism
There has been a lot of interests around parallel computing recently. One of the main reasons is that we all know the Moore's law (which promise to double the CPU power on a single chip every 18 months) has reached its limit. We cannot expect the speed of a single CPU to go much further. Instead of attempting to advance the clock rate of a CPU, many of the chip manufacturer has shifted their development focus to multi-core machines.
On the other hand, highly scalable system based on large pool of inexpensive commodity hardware has demonstrated significant success. Google has published the Map/Reduce model which is their underlying computing infrastructure and there are open source clone like Apache Hadoop. All these provides a very rich framework for implementing massively parallel system.
However, most software algorithms that we are using today are sequential in nature. We need to refactor them in order to fit into the parallel computing architecture.
How do we go about doing that ?
There are two different approaches to restructure a sequential algorithm into parallel, “functional decomposition” is typically used to deal with complex logic flow; and “map reduce” is used to deal with algorithm with large volume of input data with simple logic flow.
Functional Decomposition
This model attempts to break down the sequential algorithm into multiple “work units” from a functionality perspective and see if different work units can be executed in parallel. The whole analysis and design will typically go through the following steps.
Decomposition
The purpose of this step is to identify the function boundary of each work unit, which is the basic unit of execution that occurs in a specific machine sequentially
- Analyze the processing steps from a functionality boundary perspective. Break down the whole processing into a sequence of work units where each work unit represents a focused function.
- At this stage, we typically breakdown to the finest level of granularity so that we have more flexibility in the design stage to maximize the degree of parallelism.
Dependency analysis
After we break down the whole process into the finest grain of work units, we analyze the sequential dependency between different work units. Lets say workUnitB is following workUnitA in the sequential version of algorithm, and R(B) and W(B) represents the read set and write set of work unit B. Then workUnitB is directly dependent on workUnitA if any of the following conditions is true
- W(B) and W(A) overlaps
- R(B) and W(A) overlaps
- W(B) and R(A) overlaps
If we represent each work unit as a node and each “directly dependent” relationship as an arc, we will end up having a DAG (directed acyclic graph). The DAG gives us a good picture about what is the maximum parallelism that we can obtain. The critical path of the DAG provides the lower bound of the total execution time.
Analyzing communication overhead
However, as data need to be fed from an upstream work unit to its downstream work units, communication is not free as it consumes bandwidth and latency. In fact, parallelism introduces communication and coordination overhead. This purpose of this step is to understand the associated communication cost when data flow between work units.
Depends on the chosen framework technology, the communication mechanism can be one of the following …
- TCP Point to point: Persistent TCP connections are maintained between different machines and will be used to pass data between its residing work units.
- Multicast pub/sub: Downstream work units subscribe their interests to upstream work units and use a multicast mechanism to deliver data. The implementation of multicast can be based on IP multicast or epidemic message spreading over an overlay network.
- Queue: Upstream work unit put their result into a queue, which is polled by its downstream work units. FIFO semantics is provided.
- DFS: Upstream work unit put their results into a distributed file system, which is consumed by downstream work units. Unlike a queue, the communicating work units need to synchronize their access to the DFS themselves.
Aggregating work units
The purpose of this step is to regroup the work unit into coarser granularity to reduce communication overhead. For example, if workUnitA is feeding large amount of data into workUnitB, then both work units should be put into the same machine to reduce the network bandwidth consumption.
When there are multiple work units residing in the same machine, then they can be further aggregated into a larger unit. This aggregation can reduce the number of nodes in the dependency graph and hence make the scheduling more straightforward.
Another DAG is produced at the end of this step where each node represents the work aggregate.
Schedule execution
The work aggregates eventually need to be executed in some machines in the network. It is the responsibility of the scheduler to ship the job to available processors, and synchronize their execution. A node (in the DAG) is ready for execution when all the preceding nodes are completed. There is also a pool of idle processors.
A simple-mind scheduler will schedule a ready-to-execute node to a randomly picked processor from the idle pool. After the processor finishes executing a node, it will report back to the scheduler which will update the DAG and the idle processor pool. The cycle repeats.
A more sophisticated scheduler will consider more factors such as the network bandwidth between processors, estimated execution time of each node … etc. in order to provide an optimal scheduling where network bandwidth consumption is minimized.
Map Reduce
For data intensive application, large amount of data need to be processed within a single work unit although the DAG itself is simple. In this model, just running different work unit in parallel is not sufficient, the execution within a work unit also need to be parallelized and run across multiple machines.
The design methodology is different here. Instead of focusing in the flow between work units, we need to focus the input data pattern of a single work unit. Map/Reduce model is a common choice to handle this scenario. The analysis and design will typically go through the following steps.
- Identify the repetition of input data, determine the basic unit of input record. ie: input
- Identify the selection criteria of each input record. ie: select() function
- For each input record, determine how many entries to be emitted and how the emit entries should be grouped and process together. ie: handle_map(), key(), value() function
- Determine the aggregation logic of grouped entries. ie: handle_reduce() function
- Identify the selection criteria of each aggregated result. ie: having() function
If we use the Map/Reduce framework such as Hadoop, we can structure the map() and reduce() function as follows:
Conclusion
By following a systematic methodology to transform a sequential application into parallel one, we can take advantage of the parallelism to make the application more scalable.
I am an architect of Adobe working in service hosting area. I am interested and specialized in SaaS, Cloud computing and Parallel processing using Map/Reduce and Hadoop stack. Ricky is a DZone MVB and is not an employee of DZone and has posted 25 posts at DZone. You can read more from them at their website.
- Login or register to post comments
- 730 reads
- Printer-friendly version
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)













Comments
peterhuber replied on Sun, 2008/11/23 - 10:36am
Thank you Ricky for this short introduction.
I'm quite sure that parallelism is one of the next big hypes - at first it's "new" and as that interesting for the it community which always looks for the next big thing and secondly it's driven by the growing number of cores on each piece of silicium in our personal computers. Think of the video cards of nvidia for instance...
But stop "new" - I remember hearing about ordering task based on their input- and output-sets during a course at university. It was a lecture about operating systems and scheduling algorithms. I guess it was around mid 90'ies and I guess the idea wasn't new even back then.
I can remember a second course I took which was about your topic here "parallelism". What struck me most was to realize that there are algorithms that are "made" for parallelism and others are clearly not - not with an naive standard approach. An example for the later was the computation of the transitive closure. That algortihm would be useful for computing "communication-paths and cost" in what you call a more sophisticated scheduler ;-)
So I think it's nice that you present some basic ideas here about parallelising algorithms, but what I miss is a short "Please Note/Warning/Attention"-Section which covers at least some of the (d)effects you can run into and if it's only "Don't expect to have everything twice as fast if you double the number of cores"...
And furthermore I miss some sort of "literature/links/further readings"-section for becoming parallel-enthusiasts. Maybe you could hand that in later?
Peter
riho replied on Sun, 2008/11/23 - 6:59pm
Thanks Peter for your feedback.
Yes, parallelising a problem that is inherantly parallel is trivial. These kind of problem is called "embrassingly parallel" and guess what, many of the real world problem is like that. Think about Google's web crawler, indexing ... etc.
But there are other problem which is less trivial. Usually they have more steps that are sequentially dependent on each other. These kind of problem needs more analysis but usually some degree of parallelismcan be achieved.
Part of my day job is to explore potential restructuring of sequential algorithm so that it can be execute faster. I am quite surpirse that many algorithms that seems to be sequential at the beginning can in fact be parallelized. Even for calculating a transitive closure of a graph, I can imagine that we can start multiple threads to walk different parts of the graph at the branch point.
I will definitely share more findings as I learn more.