DevOps Zone is brought to you in partnership with:

Sasha Goldshtein is a Senior Consultant for Sela Group, an Israeli company specializing in training, consulting and outsourcing to local and international customers.Sasha's work is divided across these three primary disciplines. He consults for clients on architecture, development, debugging and performance issues; he actively develops code using the latest bits of technology from Microsoft; and he conducts training classes on a variety of topics, from Windows Internals to .NET Performance. You can read more about Sasha's work and his latest ventures at his blog: Sasha writes from Herzliya, Israel. Sasha is a DZone MVB and is not an employee of DZone and has posted 204 posts at DZone. You can read more from them at their website. View Full User Profile

Uneven Work Distribution and Oversubscription

  • submit to reddit

A few days ago I was teaching our Win32 Concurrent Programming course and showed students an experiment with the std::thread class introduced in C++ 11. The experiment is designed to demonstrate how to partition work across multiple threads and coordinate their execution, and the work to partition is simply counting the number of primes in a certain interval.

You can find the whole benchmark here. The heart of the code is the parallelize_count function, below:

void parallelize_count(unsigned nthreads, unsigned begin, unsigned end) {
    std::vector<std::thread> threads;
    unsigned chunk_size = (end - begin) / nthreads;
    for (unsigned i = 0; i < nthreads; ++i) {
        unsigned chunk_begin = chunk_size*i;
        unsigned chunk_end = chunk_begin + chunk_size;
        threads.emplace_back([=]() {
            count_primes(chunk_begin, chunk_end);
    std::for_each(threads.begin(), threads.end(), std::mem_fn(&std::thread::join));

Many people expect this to scale linearly with the number threads, up to the hardware limit – the number of processors actually installed on the system. The reality is slightly more complex. Below are the results from one of the trial runs on my system, with an i7-3720QM processor (4 physical cores, 8 logical cores):


The orange and blue lines are results from gcc 4.8.2 on OS X 10.9 and from Visual C++ 2013 on Windows 8. VC’s spike around 50 threads is not reproducible, and can be attribute to measurement perturbations. The interesting results, which are pretty similar across both compilers, are the following:

  1. The speedup is not quite linear. For example, going from 1 to 2 threads is only 27% faster (as opposed to 50% linear speedup).
  2. Even though I only have 4 physical cores on this machine, there is still a considerable 37% speed bump from 4 to 8 threads.
  3. Even after the hardware limits are exceeded, we can still see speedups from introducing more threads to the mix. For example, going from 8 to 16 threads produces a 27% speedup.

The explanation for this is the uneven workload distribution between the threads. Counting primes in an interval of larger numbers takes much longer than in an interval of smaller numbers. The Visual Studio Concurrency Visualizer can show you this graphically – and even embed markers from your code into the generated report. For example, here is a sample showing the three threads working on the range [0, 100000):


And here are the eight threads working on the same range:


This uneven distribution means that adding more threads – i.e., forcefully oversubscribing the scheduler with more threads than processors – is actually beneficial here, because of the uneven workload distribution.

The practical lesson here is that distributing works across threads is not easy. One good way to get rid of these (and other) problems is to use smaller chunks of work and queue them to the thread pool. There are of course even higher-level abstractions, such as parallel_for, designed to handle the work item partitioning and distribution for you. But that’s a story for a different post.

Published at DZone with permission of Sasha Goldshtein, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)