Big Data/Analytics Zone is brought to you in partnership with:

Enterprise Architect in HCL Technologies a $7Billion IT services organization. My role is to work as a Technology Partner for large enterprise customers providing them low cost opensource solutions around Java, Spring and vFabric stack. I am also working on various projects involving, Cloud base solution, Mobile application and Business Analytics around Spring and vFabric space. Over 23 yrs, I have build repository of technologies and tools I liked and used extensively in my day to day work. In this blog, I am putting all these best practices and tools so that it will help the people who visit my website. Krishna is a DZone MVB and is not an employee of DZone and has posted 64 posts at DZone. You can read more from them at their website. View Full User Profile

SpringData-Hadoop: Jumpstart Hadoop with Spring

  • submit to reddit

These days there are lot of hype around jargons like HadoopHBaseHivePig and BigData. I was itching to learn what are these terms and how I can see them in the real world. I had 2 goals setup up for me,

  1. Create Hadoop Single Node instance
  2. Of course figure out how it is integrated with Spring/Spring Batch

As usual, I googled how to quickly set up and learn these tools. The journey was not smooth. For a Windows user there there are 2 ways you can setup Hadoop Single node cluster on your machine.

  1. Cygwin: The first approach is not easy to setup, I took few days to struggle thru this without much results on my Windows 7
  2. Open source and Commercial VM: EMC-GreenPlum (commercial), Cloudera / Yahoo (opensource) have created VMware instances with Hadoop, Hive, bundled into the VM and and they claim it works out of the box. Yahoo VM partially worked in my machine but it is outdated, it does not integrate with Spring. Cloudera VM did not work in my machine because of some 64bit conflicts.
  3. I got another VM instance from Cloudera for 32bit and it worked. This is a Ubuntu VM instance with all the above tools installed and preconfigured.

I started with Option 3, you can start the VM and do some quick tests as described in the tutorial. If you are in a real hurry, you can open the terminal and run this commands,

cd /usr/lib/hadoop
hadoop jar hadoop-examples.jar pi 10 1000000

Good luck, you ran your 1st Hadoop job.

Now in the same VM download Gradle and SpringData-Hadoop Installation. Unzip both of these in your Cloudera home directory. Go to your .profile file and Add the below line in the end,

export PATH=$PATH:/user/cloudera/gradle-1.0-rc-3/bin

Note your Gradle version maybe different and you should change it accordingly.

Now go to <SpringData-Hadoop Home>/samples/batch-wordcount and open build.gradle file and remove the repositories entries and add the following lines,

repositories {
// Public Spring artefacts
maven { url "" }
maven { url "" }
maven { url "" }
maven { url "" }
maven { url "" }
maven { url "" }
maven { url "" }

Open <SpringData-Hadoop Home>/samples/batch-wordcount/ and modify

hadoopVersion = 0.20.2-cdh3u3

Open <SpringData-Hadoop Home>/samples/batch-wordcount/src/main/resources/ and edit the below lines


Now go to command prompt and run gradle test, the test will be successful. Here is the documentation/tutorial on Spring Hadoop integration

If you want to learn more about Hadoop, there are good tutorials from Cloudera and YDN, please go thru it.

Published at DZone with permission of Krishna Prasad, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)