DevOps Zone is brought to you in partnership with:

Geoff Papilion has made a living running infrastructure for the past 15 years. He is currently employeed at Wikia.com, scaling the infrastructure to 1.5 billion request per day. Geoffrey is a DZone MVB and is not an employee of DZone and has posted 26 posts at DZone. You can read more from them at their website. View Full User Profile

Configuration only Deployments: Reduce outage windows by 90%

09.14.2012
| 4759 views |
  • submit to reddit

Monitoring configuration is complicated, and the depths that you can configure alerts and tests seems endless. It may seem like a waste of time to invest in some options, but others can really help you eliminate states that send hundreds of alerts. Your end goal in your configuration is to narrow down any alert sent to the pager to be immediately actionable, and that all other issues are ignored. Certain Failure states like failed switches, routers, can cause a flood of alerts since they take down the network infrastructure, and obscure the true cause of an outage.

Defining the Right Config

The first step you can take to prevent a flood of pages is to define all you routers, switches, and other network equipment in your Nagios config. After you have that defined you simply need to define a parent on the config object.
For example:

    # Primary Switch in VRRP Group
    define host {
    use switch
    address 10.0.0.2
    host_name switch-1
    hostgroups switches
    }

    #Secondary Switch in VRRP Group
    define host {
    use switch
    address 10.0.0.3
    host_name switch-2
    hostgroups switches
    }

    define host {
    use server
    address 10.0.0.100
    host_name apache-server-1
    hostgroups servers, www
    parents switch-1, switch-2
    }

This will configure the host apache-server-1 such that if switch-1 and switch-2 fail, alerts will be silence from the client. The alerts will remain off until either switch-1 or switch-2 becomes available again.

A Few Things to Keep in Mind

Nagios is pretty smart, and can handle multiple parents so that alerts will only be silenced if both parents become unavailable.

The availability of parent hosts is determined by the host health check, most commonly ping. If you need some other test of availability, make sure to define this in the host object.

Parent all the objects you can or that make sense to parent. For example, a router or transport failure at a remote data center should only send a single alert. This means you should define your routers, switches, and possibly your providers gateways. Do whatever you think makes sense, and take it as far as your can. Remember your goal is to make the number of alerts manageable, so the better you define the topology the less likely you are to get a useless page, or several hundred useless pages.

Published at DZone with permission of Geoffrey Papilion, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)