How to Tweak Ganglia Using Hadoop
It's
very important to monitor all the machines in the cluster in terms of
OS health, bottlenecks, performance hits and so on. There are numerous
tools available that spit out huge number of graphs and statistics.
But to the administrator of the cluster, only the prominent stats that
seriously affect the performance or the health of the cluster should
be portrayed.
Ganglia fits the bill.
Introduction to Ganglia:
Ganglia is a scalable distributed monitoring system for
high-performance computing systems such as clusters and Grids. It is
based on a hierarchical design targeted at federations of clusters. It
relies on a multicast-based listen/announce protocol to monitor state
within clusters and uses a tree of point-to-point connections amongst
representative cluster nodes to federate clusters and aggregate their
state. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool
for data storage and visualization. It uses carefully engineered data
structures and algorithms to achieve very low per-node overheads and
high concurrency.
The implementation is robust, has been ported
to an extensive set of operating systems and processor architectures,
and is currently in use on over 500 clusters around the world. It has
been used to link clusters across university campuses and around the
world and can scale to handle clusters with 2000 nodes.
Having known bits of the fact what Ganglia is, there are some pure terminologies to be known before the kick start.
1. Node : Generally, it's a machine targeted to perform a single task with its (1-4)core processor.
2. Cluster : Cluster consists of a group of nodes.
3. Grid : Grid consists of group of clusters.
Heading towards Ganglia...
The ganglia system is comprised of two unique daemons, a PHP-based web frontend and a few other small utility programs.
- Ganglia Monitoring Daemon(gmond)
Gmond is a multi-threaded daemon which runs on each cluster node you want to monitor.Its responsible for monitoring changes in host state, announcing relevant changes, listening to the state of all other ganglia nodes via a unicast or multicast channel and answer requests for an XML description of the cluster state. Each gmond transmits an information in two different ways unicasting/multicasting host state in external data representation (XDR) format using UDP messages or sending XML over a TCP connection.
- Ganglia Meta Daemon(gmetad)
Ganglia Meta Daemon ("gmetad") periodically polls a collection of child data sources, parses the collected XML, saves all numeric, volatile metrics to round-robin databases and exports the aggregated XML over TCP sockets to clients. Data sources may be either "gmond" daemons, representing specific clusters, or other "gmetad" daemons, representing sets of clusters. Data sources use source IP addresses for access control and can be specified using multiple IP addresses for failover. The latter capability is natural for aggregating data from clusters since each "gmond" daemon contains the entire state of its cluster.
- Ganglia PHP Web Frontend
The Ganglia web frontend provides a view of the gathered information via real-time dynamic web pages to the system administrators.
Setting up Ganglia:
Assume the following:
- Cluster : "HadoopCluster"
- Nodes : "Master", "Slave1", "Slave2". {Considered only 3 nodes for examples. Similarly many nodes/slaves can be configured.}
- Grid : "Grid" consists of "HadoopCluster" for now.
gmond is supposed to be installed on all the nodes ie. "Master", "Slave1", "Slave2". gmetad and web-frontend will be on Master. On, the Master node, we can see all the statistics in the web-frontend. However, we can have a dedicated server for web-frontend too.
Step 1: Install gmond on "Master", "Slave1" and "Slave2"
Installing Ganglia can be done by downloading the respective tar.gz, extracting, configure, make and install. But, why to reinvent the wheel. Let's go by installing the same with repository.
OS: Ubuntu 11.04
Ganglia version 3.1.7
Hadoop version CDH3 hadoop-0.20.2
Update your repository packages.
$ sudo apt-get updateInstalling depending packages.
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-devInstalling gmond.
$ sudo apt-get install ganglia-monitorMaking changes in /etc/ganglia/gmond.conf
$ sudo vi /etc/ganglia/gmond.conf
/* This configuration is as close to 2.5.x default behavior as possible
The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
daemonize = yes
setuid = no
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 0 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 0
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
* of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will
* NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "HadoopCluster"
owner = "Master"
latlong = "unspecified"
url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
location = "unspecified"
}
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
host = Master
port = 8650
ttl = 1
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8650
# bind = 239.2.11.71
}
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
} ...About the configuration changes:Check the globals once. In the cluster {}, change the name of the cluster as assumed to "HadoopCluster" from unspecified, owner to "Master"( can be your organisation/admin name), latlong,url can be specified according to your location. No harm in keeping them unspecified.
As said, gmond communicates using UDP messages or sending XML over a TCP connection. So, lets get this clear!
udp_send_channel {
host = Master
port = 8650
ttl = 1
} ...means that host recieving at the end
point will be "Master" (where Master is the hostname with associated IP
address. Add all the hostnames in /etc/hosts with the respective IPs).
The port at which it accepts is 8650.Since, gmond is configured now at "Master", the UDP recieve channel is 8650.
udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8650
# bind = 239.2.11.71
}
All the XML description which could be hadoop metrics, system metrics etc. is accepted at port:8649tcp_accept_channel {
port = 8649
} ..Starting Ganglia:$ sudo /etc/init.d/ganglia-monitor start $ telnet Master 8649Output should contain XML format.
This means gmond is up.
$ ps aux | grep gmondshows gmond
Installing gmond on Slave machines is same with gmond.conf being,
/* This configuration is as close to 2.5.x default behavior as possible
The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
daemonize = yes
setuid = no
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 0 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 0
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
* of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will
* NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "HadoopCluster"
owner = "Master"
latlong = "unspecified"
url = "unspecified"
}
/* The host section describes attributes of the host, like the location */
host {
location = "unspecified"
}
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
host = Master
port = 8650
ttl = 1
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8650
# bind = 239.2.11.71
}
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
}....Starting Ganglia:$ sudo /etc/init.d/ganglia-monitor start $ telnet Slave1 8649Output should contain XML format
This means gmond is up.
$ ps aux | grep gmondshows gmond Step 2 : Installing gmetad on Master.
$ sudo apt-get install ganglia-webfrontendInstalling the dependencies
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-devMaking the required changes in gmetad.conf
data_source "HadoopCluster" Master
gridname "Grid"
setuid_username "ganglia"
datasource specifies the cluster name as "HadoopCluster" and Master as the sole point of consolidating all the metrics and statistics.
gridname is assumed as "Grid" initially.
username is ganglia.
Check for /var/lib/ganglia directory. If not existing then,
mkdir /var/lib/ganglia
mkdir /var/lib/ganglia/rrds/
and then
$ sudo chown -R ganglia:ganglia /var/lib/ganglia/Running the gmetad
$ sudo /etc/init.d/gmetad startYou can stop with
$ sudo /etc/init.d/gmetad stopand run in debugging mode once.
$ gmetad -d 1Now, restart the daemon
$ sudo /etc/init.d/gmetad restartStep 3 : Installing PHP Web-frontend dependent packages at "Master"
$ sudo apt-get -y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev librrd2-devCheck for /var/www/ganglia directory and restart the apache2
$ sudo /etc/init.d/apache2 restartTime to hit Web URL
http://Master/ganglia
in general, http://<hostname>/ganglia/
You must be able to see some graphs.
Step 4 : Configuring Hadoop-metrics with Ganglia.
On Master( Namenode, JobTracker)
$ sudo vi /etc/hadoop-0.20/conf/hadoop-metrics.properties
# Configuration of the "dfs" context for null
#dfs.class=org.apache.hadoop.metrics.spi.NullContext
# Configuration of the "dfs" context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log
# Configuration of the "dfs" context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
dfs.period=10
dfs.servers=Master:8650
# Configuration of the "dfs" context for /metrics
#dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "mapred" context for null
#mapred.class=org.apache.hadoop.metrics.spi.NullContext
# Configuration of the "mapred" context for /metrics
mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "mapred" context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log
# Configuration of the "mapred" context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=Master:8650
# Configuration of the "jvm" context for null
#jvm.class=org.apache.hadoop.metrics.spi.NullContext
# Configuration of the "jvm" context for /metrics
jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "jvm" context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log
# Configuration of the "jvm" context for ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=Master:8650
# Configuration of the "rpc" context for null
#rpc.class=org.apache.hadoop.metrics.spi.NullContext
# Configuration of the "rpc" context for /metrics
rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "rpc" context for file
#rpc.class=org.apache.hadoop.metrics.file.FileContext
#rpc.period=10
#rpc.fileName=/tmp/rpcmetrics.log
# Configuration of the "rpc" context for ganglia
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=Master:8650Hadoop provides a way to access all the metrics with GangliaContext31 class.Restart hadoop services at Master.
Restart gmond and gmetad.
$ telnet Master 8649
will spit XML metrics of Hadoop
On Slave1 (Secondary Namenode, Datanode, TaskTracker)
$ sudo gedit /etc/hadoop-0.20.2/conf/hadoop-metrics.properties
# Configuration of the "dfs" context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log
# Configuration of the "dfs" context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
dfs.period=10
dfs.servers=Master:8650
# Configuration of the "dfs" context for /metrics
#dfs.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "mapred" context for null
#mapred.class=org.apache.hadoop.metrics.spi.NullContext
# Configuration of the "mapred" context for /metrics
mapred.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "mapred" context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log
# Configuration of the "mapred" context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=Master:8650
# Configuration of the "jvm" context for null
#jvm.class=org.apache.hadoop.metrics.spi.NullContext
# Configuration of the "jvm" context for /metrics
jvm.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "jvm" context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log
# Configuration of the "jvm" context for ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=Master:8650
# Configuration of the "rpc" context for null
#rpc.class=org.apache.hadoop.metrics.spi.NullContext
# Configuration of the "rpc" context for /metrics
rpc.class=org.apache.hadoop.metrics.spi.NoEmitMetricsContext
# Configuration of the "rpc" context for file
#rpc.class=org.apache.hadoop.metrics.file.FileContext
#rpc.period=10
#rpc.fileName=/tmp/rpcmetrics.log
# Configuration of the "rpc" context for ganglia
rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=Master:8650
Restart the datanode and the tasktracker with the command$ sudo service hadoop-0.20-tasktracker restart
$ sudo service hadoop-0.20-datanode restartRestart gmond
$ telnet Master 8649will spit the XML format of Hadoop metrics for the host Slave1
Similar procedure done to Slave1 must be followed for Slave2, restarting services of hadoop and gmond.
On the Master,
Restart gmond and gmetad with
$ sudo /etc/init.d/ganglia-monitor restart $ sudo /etc/init.d/gmetad restartHit the web URL
http://Master/ganglia
Check for Metrics, Grid, Cluster and all the nodes you configured for.
You can also witness the hosts up, hosts down, total CPUs. Lots more in store!
Enjoy monitoring your cluster! :)
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





