Big Data/Analytics Zone is brought to you in partnership with:

Coming from a background of Aerospace Engineering, John soon discovered that his true interest lay at the intersections of information technology and entrepreneurship (and when applicable - math). In early 2011, John stepped away from his day job to take up software consulting. Finally John found permanent employment at Opensource Connections where he currently consults large enterprises about full-text search and Big Data applications. Highlights to this point have included prototyping the future of search with the US Patent and Trademark Office, implementing the search syntax used by patent examiners, and building a Solr search relevancy tuning framework called SolrPanl. John is a DZone MVB and is not an employee of DZone and has posted 23 posts at DZone. You can read more from them at their website. View Full User Profile

Getting Started Quickly with Hadoop and MapReduce

  • submit to reddit

So here’s the problem: You’ve finally found a block of time to set down and get your head around Hadoop and MapReduce. You do a quick Google search for a tutorial to get your started and immediately, your problems are two-fold:

  1. You are a 23 step process and a cloud deployment away from having your first Hadoop cluster spun up.
  2. The most interesting thing you will be able to do once you get your cluster up and running is to count all the words in the complete works of Shakespeare. Ho…hum.

Well, if this is your situation, you’ll be please to find that the first problem goes away immediately upon downloading Hadoop. Doug Cutting in his infinite wisdom understood that it was intimidating to spin up an entire cluster just so that you can get started learning the platform; because of this he built in a little feature that allows you to get started immediately. As an example, let’s say you have a giant 137 core cluster in the cloud and you’ve stored the complete and unabridged works of all the classic authors on HDFS in the books directory. You can run your WordCount MapReduce on the corpus and send the results to the words directory with the following command:

${HADOOP_HOME}/bin/hadoop jar WordCount.jar org.myorg.WordCount books words

On the other hand, if you have no such cluster, but you have Macbeth andRomeo and Juliet stored in the books directory on your local machine, then you can still run your WordCount MapReduce on your measly, wimpy corpus and send the results to the words directory (again, on your local machine) by issuing the exact same command.

${HADOOP_HOME}/bin/hadoop jar WordCount.jar org.myorg.WordCount books words

Pretty easy way to get started, eh?

Issue number 2 is a bit more nefarious. Why? Because word counting is easy to understand and it really is probably the most straight-forward application of MapReduce.

However I got bored of the old WordCount Hello World, and being a fairly mathy person, I decided to make my own Hello World with a mathematical twist! Take a look!

Published at DZone with permission of John Berryman, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)