Hadoop Hangover: Launch a Hadoop Cluster CDH4 Using Apache Whirr
This post is about how-to launch a CDH4 MRv1 or CDH4 Yarn cluster on EC2 instances. It's said that you can launch a cluster with the help of Whirr and in a matter of 5 minutes! This is very true if and only if everything works out well! ;)
Hopefully, this article helps you in that regard.
So, let's row the boat...
- Download the stable version of Apache Whirr ie. whirr-0.8.1.tar.gz from the following link whirr-0.8.1.tar.gz
- Extract from the tarball and generate the key
$ tar -xzvf whirr-0.8.1.tar.gz $ cd whirr-0.8.1
- Generate the key
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr $ cd whirr-0.8.1
- Make a properties file to launch the cluster with that configuration. # Cluster name goes here
whirr.cluster-name=testcluster # Change the number of machines in the cluster here # Using 3 DN and TT and 1JT and NN # Ganglia is configured whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode+ganglia-monitor+ganglia-metad,3 hadoop-datanode+hadoop-tasktracker+ganglia-monitor # Install JAVA whirr.java.install-function=install_openjdk whirr.java.install-function=install_oab_java ## Install CDH4 MRV1 whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop whirr.env.REPO=cdh4 # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. whirr.provider=aws-ec2 whirr.hardware-id=c1.xlarge # Credentials should go here whirr.identity=XXXXXXXXXXXXXXXXX whirr.credential=XXXXXXXXXXXXXXXXXXXX whirr.cluster-user=whirr whirr.private-key-file=/home/ubuntu/.ssh/yourKey whirr.public-key-file=/home/ubuntu/.ssh/yourKey.pub
- Now let me tell you how to avoid getting headaches!
- cluster name: Keep your cluster name simple. Avoid testCluster, testCluster1 etc. ie. No Caps, numerics..
- Decide on the number of datanodes you want judiciously.
- Your launch may not be successful, if java is not installed. Make sure the image has Java. However, this properties file takes care of that.
- It will be good to go ahead with MRv1 for now and later switch to MRv2, when we get a production stable release.
- This is the minimal set of configurations for launching a Hadoop cluster. But, you can do a lot performance tuning upon this.
- I had launched this cluster from an ec2 instance, Initially i faced errors, regarding user. Setting the configuration below, solved the problem.
- Set proper permissions for ~/.ssh and whirr-0.8.1 folder before launching.
- Well, we are ready to launch the cluster. Name the properties file as "whirr_cdh.properties".
$ cd whirr-0.8.1 $ bin/whirr launch-cluster --config whirr_cdh.propertiesIn the console you can see, links to Namenode and JobTracker Web UI. It also prints how to ssh to the instances in the end.
- Now, you should be having the files generated. You will be able to see these files: instances, hadoop-proxy.sh and hadoop-site.xml
- Starting the proxy
$ sh hadoop-proxy.sh
$ export HADOOP_CONF_DIR=~/.whirr/testcluster/hadoop-site.xml $ hadoop fs -ls /
$ bin/hadoop --config ~/.whirr/testcluster fs -ls /
- Now, Launch Firefox (3.0v+)
- Download the FoxyProxy extension by clicking this link.
- Steps to configure and access the UI
- Select Tools > FoxyProxy > Options
- Click the “Add New Proxy” button.
- Select “Manual Proxy Configuration”
- Enter “localhost” for the “Host or IP Address” field.
- Enter “6666″ for the “Port” field.
- Click on the “General” tab at the top of the dialog box.
- Enter “EC2″ for the “Proxy Name” field.
- Click on the “URL Patterns” tab at the top of the dialog box.
- Click the “Add New Pattern” button.
- Enter “EC2″ for the “Pattern Name” field.
- Enter “*compute-1.amazonaws.com*, *.ec2.internal*, *.compute-1.internal*” for the “URL pattern” field (not case sensitive)
- Select the “Whitelist” and “Wildcards” radio buttons.
- Click the “OK” button to dismiss the new URL pattern dialog box.
- Click the “OK” button to dismiss the new proxy dialog box.
- Completely disable the Foxyproxy for now.
- You should be able to see 2 proxy names after closing, default and EC2.
- Click on “Use proxy EC2 for all URLs” from the pop-up menu of FoxyProxy
- Copy the URL of JobTracker (can be seen while running proxy, ec2-***-**-***-**.********.amazonaws.com) and paste it in the browser.
So, we are good to go!
- If you want to launch MRv2, use this.
## Cluster name goes here. whirr.cluster-name=yarncluster # Change the number of machines in the cluster here whirr.instance-templates=1 hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver,2 hadoop-datanode+yarn-nodemanager # Install JAVA whirr.java.install-function=install_openjdk whirr.java.install-function=install_oab_java ## Install CDH4 Yarn whirr.hadoop.install-function=install_cdh_hadoop whirr.hadoop.configure-function=configure_cdh_hadoop whirr.yarn.configure-function=configure_cdh_yarn whirr.yarn.start-function=start_cdh_yarn whirr.mr_jobhistory.start-function=start_cdh_mr_jobhistory whirr.env.REPO=cdh4 whirr.env.MAPREDUCE_VERSION=2 # For EC2 set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. whirr.provider=aws-ec2 whirr.hardware-id=c1.xlarge # Credentials should go here whirr.identity=XXXXXXXXXXXXXXXXX whirr.credential=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX whirr.cluster-user=whirr whirr.private-key-file=/home/ubuntu/.ssh/yourKey whirr.public-key-file=/home/ubuntu/.ssh/yourKey.puband the same process!
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)