Big Data/Analytics Zone is brought to you in partnership with:

Technology Evangelist. Love Art and Technology. Swathi is a DZone MVB and is not an employee of DZone and has posted 15 posts at DZone. You can read more from them at their website. View Full User Profile

How to Tweak Ganglia Using Hadoop

08.08.2012
| 7145 views |
  • submit to reddit

It's very important to monitor all the machines in the cluster in terms of OS health, bottlenecks, performance hits and so on. There are numerous tools available that spit out huge number of graphs and statistics. But to the administrator of the cluster, only the prominent stats that seriously affect the performance or the health of the cluster should be portrayed.
                        Ganglia fits the bill.


Introduction to Ganglia:
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency.

The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on over 500 clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

Having known bits of the fact what Ganglia is, there are some pure terminologies to be known before the kick start.
1. Node : Generally, it's a machine targeted to perform a single task with its (1-4)core processor.
2. Cluster : Cluster consists of a group of nodes.
3. Grid : Grid consists of group of clusters.

Heading towards Ganglia...
The ganglia system is comprised of two unique daemons, a PHP-based web frontend and a few other small utility programs.


- Ganglia Monitoring Daemon(gmond)
Gmond is a multi-threaded daemon which runs on each cluster node you want to monitor.Its responsible for monitoring changes in host state, announcing relevant changes, listening to the state of all other ganglia nodes via a unicast or multicast channel and answer requests for an XML description of the cluster state. Each gmond transmits an information in two different ways unicasting/multicasting host state in external data representation (XDR) format using UDP messages or sending XML over a TCP connection.
  - Ganglia Meta Daemon(gmetad)
 Ganglia Meta Daemon ("gmetad") periodically polls a collection of child data sources, parses the collected XML, saves all numeric, volatile metrics to round-robin databases and exports the aggregated XML over TCP sockets to clients. Data sources may be either "gmond" daemons, representing specific clusters, or other "gmetad" daemons, representing sets of clusters. Data sources use source IP addresses for access control and can be specified using multiple IP addresses for failover. The latter capability is natural for aggregating data from clusters since each "gmond" daemon contains the entire state of its cluster.
  - Ganglia PHP Web Frontend
 The Ganglia web frontend provides a view of the gathered information via real-time dynamic web pages to the system administrators.

Setting up Ganglia:
Assume the following:
- Cluster : "HadoopCluster"
- Nodes : "Master", "Slave1", "Slave2". {Considered only 3 nodes for examples. Similarly many nodes/slaves can be configured.}
- Grid : "Grid" consists of "HadoopCluster" for now.

 gmond is supposed to be installed on all the nodes ie. "Master", "Slave1", "Slave2". gmetad and web-frontend will be on Master. On, the Master node, we can see all the statistics in the web-frontend. However, we can have a dedicated server for web-frontend too.

Step 1: Install gmond on "Master", "Slave1" and "Slave2"
Installing Ganglia can be done by downloading the respective tar.gz, extracting, configure, make and install. But, why to reinvent the wheel. Let's go by installing the same with repository.

OS: Ubuntu 11.04
Ganglia version 3.1.7
Hadoop version CDH3 hadoop-0.20.2

Update your repository packages.
$ sudo apt-get update
Installing depending packages.
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev
Installing gmond.
$  sudo apt-get install ganglia-monitor
Making changes in /etc/ganglia/gmond.conf
$  sudo vi /etc/ganglia/gmond.conf
/* This configuration is as close to 2.5.x default behavior as possible
   The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {                   
  daemonize = yes             
  setuid = no
  user = ganglia             
  debug_level = 0              
  max_udp_msg_len = 1472       
  mute = no            
  deaf = no            
  host_dmax = 0 /*secs */
  cleanup_threshold = 300 /*secs */
  gexec = no            
  send_metadata_interval = 0    
}

/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
 * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
 * NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
  name = "HadoopCluster"
  owner = "Master"
  latlong = "unspecified"
  url = "unspecified"
}

/* The host section describes attributes of the host, like the location */
host {
  location = "unspecified"
}

/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = Master
  port = 8650
  ttl = 1
}

/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
#  mcast_join = 239.2.11.71
  port = 8650
#  bind = 239.2.11.71
}

/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
} ...
About the configuration changes:

Check the globals once. In the cluster {}, change the name of the cluster as assumed to "HadoopCluster" from unspecified, owner to "Master"( can be your organisation/admin name), latlong,url can be specified according to your location. No harm in keeping them unspecified.

As said, gmond communicates using UDP messages or sending XML over a TCP connection. So, lets get this clear!
udp_send_channel {
      host = Master
      port = 8650
      ttl = 1
    }  ...
means that host recieving at the end point will be "Master" (where Master is the hostname with associated IP address. Add all the hostnames in /etc/hosts with the respective IPs). The port at which it accepts is 8650.
Since, gmond is configured now at "Master", the UDP recieve channel is 8650.
udp_recv_channel {
    #  mcast_join = 239.2.11.71
      port = 8650
    #  bind = 239.2.11.71
    } 
All the XML description which could be hadoop metrics, system metrics etc. is accepted at port:8649
tcp_accept_channel {
      port = 8649
    } ..
Starting Ganglia:
$ sudo /etc/init.d/ganglia-monitor start
$ telnet Master 8649
Output should contain XML format.

This means gmond is up.
$ ps aux | grep gmond
shows gmond

Installing gmond on Slave machines is same with gmond.conf being,
    /* This configuration is as close to 2.5.x default behavior as possible
       The values closely match ./gmond/metric.h definitions in 2.5.x */
    globals {
      daemonize = yes
      setuid = no
      user = ganglia
      debug_level = 0
      max_udp_msg_len = 1472
      mute = no
      deaf = no
      host_dmax = 0 /*secs */
      cleanup_threshold = 300 /*secs */
      gexec = no
      send_metadata_interval = 0
    }

    /* If a cluster attribute is specified, then all gmond hosts are wrapped inside
     * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
     * NOT be wrapped inside of a <CLUSTER> tag. */
    cluster {
      name = "HadoopCluster"
      owner = "Master"
      latlong = "unspecified"
      url = "unspecified"
    }

    /* The host section describes attributes of the host, like the location */
    host {
      location = "unspecified"
    }

    /* Feel free to specify as many udp_send_channels as you like.  Gmond
       used to only support having a single channel */
    udp_send_channel {
      host = Master
      port = 8650
      ttl = 1
    }

    /* You can specify as many udp_recv_channels as you like as well. */
    udp_recv_channel {

    #  mcast_join = 239.2.11.71
      port = 8650
     # bind = 239.2.11.71
    }

    /* You can specify as many tcp_accept_channels as you like to share
       an xml description of the state of the cluster */
    tcp_accept_channel {
      port = 8649
    }....
Starting Ganglia:
$ sudo /etc/init.d/ganglia-monitor start
$ telnet Slave1 8649
Output should contain XML format

This means gmond is up.
$ ps aux | grep gmond
shows gmond  Step 2 : Installing gmetad on Master.
$ sudo apt-get install ganglia-webfrontend
Installing the dependencies
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-dev
Making the required changes in gmetad.conf
data_source "HadoopCluster"  Master
gridname "Grid"
setuid_username "ganglia"

datasource specifies the cluster name as "HadoopCluster" and Master as the sole point of consolidating all the metrics and statistics.
gridname is assumed as "Grid" initially.
username is ganglia.

Check for /var/lib/ganglia directory. If not existing then,
mkdir /var/lib/ganglia
mkdir /var/lib/ganglia/rrds/
and then
$ sudo chown -R ganglia:ganglia /var/lib/ganglia/
Running the gmetad
$ sudo /etc/init.d/gmetad start
You can stop with 
$ sudo /etc/init.d/gmetad stop 
and run in debugging mode once.
$ gmetad -d 1
Now, restart the daemon
$ sudo /etc/init.d/gmetad restart
Step 3 : Installing PHP Web-frontend dependent packages at "Master"
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-dev
Check for /var/www/ganglia directory and restart the apache2
$ sudo /etc/init.d/apache2 restart
Time to hit Web URL
http://Master/ganglia
in general, http://<hostname>/ganglia/

You must be able to see some graphs.


Step 4 : Configuring Hadoop-metrics with Ganglia.
On Master( Namenode, JobTracker)
$ sudo vi /etc/hadoop-0.20/conf/hadoop-metrics.properties

    # Configuration of the "dfs" context for null
    #dfs.class=org.apache.hadoop.metrics.spi.NullContext

    # Configuration of the "dfs" context for file
    #dfs.class=org.apache.hadoop.metrics.file.FileContext
    #dfs.period=10
    #dfs.fileName=/tmp/dfsmetrics.log

    # Configuration of the "dfs" context for ganglia
     dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     dfs.period=10
     dfs.servers=Master:8650

    # Configuration of the "dfs" context for /metrics
    #dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext


    # Configuration of the "mapred" context for null
    #mapred.class=org.apache.hadoop.metrics.spi.NullContext

    # Configuration of the "mapred" context for /metrics
    mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

    # Configuration of the "mapred" context for file
    #mapred.class=org.apache.hadoop.metrics.file.FileContext
    #mapred.period=10
    #mapred.fileName=/tmp/mrmetrics.log

    # Configuration of the "mapred" context for ganglia
     mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     mapred.period=10
     mapred.servers=Master:8650


    # Configuration of the "jvm" context for null
    #jvm.class=org.apache.hadoop.metrics.spi.NullContext

    # Configuration of the "jvm" context for /metrics
    jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

    # Configuration of the "jvm" context for file
    #jvm.class=org.apache.hadoop.metrics.file.FileContext
    #jvm.period=10
    #jvm.fileName=/tmp/jvmmetrics.log

    # Configuration of the "jvm" context for ganglia
     jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     jvm.period=10
     jvm.servers=Master:8650

    # Configuration of the "rpc" context for null
    #rpc.class=org.apache.hadoop.metrics.spi.NullContext

    # Configuration of the "rpc" context for /metrics
    rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

    # Configuration of the "rpc" context for file
    #rpc.class=org.apache.hadoop.metrics.file.FileContext
    #rpc.period=10
    #rpc.fileName=/tmp/rpcmetrics.log

    # Configuration of the "rpc" context for ganglia
     rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     rpc.period=10
     rpc.servers=Master:8650
Hadoop provides a way to access all the metrics with GangliaContext31 class.
Restart hadoop services at Master.
Restart gmond and gmetad.
$ telnet Master 8649

will spit XML metrics of Hadoop

On Slave1 (Secondary Namenode, Datanode, TaskTracker)
$ sudo gedit /etc/hadoop-0.20.2/conf/hadoop-metrics.properties

    # Configuration of the "dfs" context for file
    #dfs.class=org.apache.hadoop.metrics.file.FileContext
    #dfs.period=10
    #dfs.fileName=/tmp/dfsmetrics.log

    # Configuration of the "dfs" context for ganglia
     dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     dfs.period=10
     dfs.servers=Master:8650

    # Configuration of the "dfs" context for /metrics
    #dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext


    # Configuration of the "mapred" context for null
    #mapred.class=org.apache.hadoop.metrics.spi.NullContext

    # Configuration of the "mapred" context for /metrics
     mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

    # Configuration of the "mapred" context for file
    #mapred.class=org.apache.hadoop.metrics.file.FileContext
    #mapred.period=10
    #mapred.fileName=/tmp/mrmetrics.log

    # Configuration of the "mapred" context for ganglia
     mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     mapred.period=10
     mapred.servers=Master:8650

    # Configuration of the "jvm" context for null
    #jvm.class=org.apache.hadoop.metrics.spi.NullContext

    # Configuration of the "jvm" context for /metrics
    jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

    # Configuration of the "jvm" context for file
    #jvm.class=org.apache.hadoop.metrics.file.FileContext
    #jvm.period=10
    #jvm.fileName=/tmp/jvmmetrics.log

    # Configuration of the "jvm" context for ganglia
     jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     jvm.period=10
     jvm.servers=Master:8650

    # Configuration of the "rpc" context for null
    #rpc.class=org.apache.hadoop.metrics.spi.NullContext

    # Configuration of the "rpc" context for /metrics
    rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext

    # Configuration of the "rpc" context for file
    #rpc.class=org.apache.hadoop.metrics.file.FileContext
    #rpc.period=10
    #rpc.fileName=/tmp/rpcmetrics.log

    # Configuration of the "rpc" context for ganglia
     rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
     rpc.period=10
     rpc.servers=Master:8650
Restart the datanode and the tasktracker with the command
$ sudo service hadoop-0.20-tasktracker restart
$ sudo service hadoop-0.20-datanode restart
Restart gmond
$ telnet Master 8649
will spit the XML format of Hadoop metrics for the host Slave1

Similar procedure done to Slave1 must be followed for Slave2, restarting services of hadoop and gmond.

On the Master,
Restart gmond and gmetad with
$ sudo /etc/init.d/ganglia-monitor restart
$ sudo /etc/init.d/gmetad restart
Hit the web URL
http://Master/ganglia
Check for Metrics, Grid, Cluster and all the nodes you configured for.
You can also witness the hosts up, hosts down, total CPUs. Lots more in store!
 
Enjoy monitoring your cluster! :)
     
Published at DZone with permission of Swathi Venkatachala, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)