Big Data/Analytics Zone is brought to you in partnership with:

Nikita Ivanov is a founder and CEO if GridGain Systems – developer of one of the most innovative real time big data platform in the world. I have almost 20 years of experience in software development, a vision and pragmatic view of where development technology is going, and high quality standards in software engineering and entrepreneurship. Nikita is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Berkeley Researchers Highlight Emergence of In-Memory Processing

09.05.2012
| 2177 views |
  • submit to reddit

Researchers at the University of California, Berkeley released an excellent paper recently, analyzing data from the Hadoop installation at Facebook -- one of the largest in the world. The team examined various metrics for Hadoop jobs running at a Facebook datacenter that has over 3,000 computers dedicated to Hadoop-based processing.

I advise everyone read it firsthand, but I will list some of the most interesting bits.

The traditional quest for disk locality (a.k.a. affinity between the Hadoop task and the disk that contains the input data for that task) was based on two key assumptions:

  • Local disk access is significantly faster than network access to a remote disk
  • Hadoop tasks spend significant amount of their processing time in disk IO reading input data

Through careful analysis of the Hadoop system at Facebook (as their prime testbed), the authors claim that both of these assumptions are rapidly losing hold:

  • With new full-bisection topologies in the modern data centers, the local disk access is almost identical in performance to a network access even across the racks (with performance difference today between two is less than 10%)
  • Greater parallelization and data compressions leads to lower disk IO demand on the individual tasks; in fact, Hadoop jobs at Facebook deal mostly with text-baed data that can be compressed dramatically

The authors then argue that memory locality (i.e. keeping input data in memory and maintaining affinity between a Hadoop task and its in-memory input data) produces greater performance advantages because:

  • RAM access is up to three orders of magnitude faster than a local disk access
  • Even though memory size is significantly less than disk capacity, it is large enough for most cases

Consider this: despite the fact that 75% of all HDFS blocks are accessed only once, the 64% of Hadoop jobs at Facebook achieve the full memory locality for all their tasks. In the case of Hadoop, full locality means that there is no outlier task that will have to access the disk and delay the entire job. And this is all achieved utilizing rather primitive LFU caching policy and basic pre-fetching for input data.

With these facts, the authors conclude that disk locality is no longer worth working toward – and in-memory co-location is the way forward for high performance big data processing, as it yields far greater returns.

Published at DZone with permission of Nikita Ivanov, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)