Introduction

Revised 2013-11-19

A few years ago I started using Nagios to monitor servers at data centres. Today I installed Nagvis to display the servers graphically using diagrams drawn in Visio, converted to PNG format and reduced to fit on our monitoring scoreboard which consists of 9 monitors being driven from a PC and mounted behind a glass display in the middle of the office… it looks awesome and gives us status on everything that’s going on in pretty close to real time.. anyway back to Nagios..

After 12 months of actively using Nagios I have learned a few tips and tricks to make server configuration easy. Its called templates!

For example, here is a disk size check for a Linux server (checking the / “root” file system size) called server1 for want of a better example name:

define service {
  use check-sys-root
  host_name server1.mydomain.com
  notifications_enabled 1
}

This 5 line definition defines the need to use a disk check template called check-sys-root, if I was monitoring /tmp I would have check-sys-tmp and so forth. in fact all the disk checks are in a file called disk-check.cfg so I can reuse the check templates for every host.

The key to making Nagios manageable for a large number of hosts is to split the hosts into individual files using their full machine host name and the domain they are in i.e server1.mydomain.com.cfg. In each host file is the host check and the templated service check calls.

I group common checks together, so:

  1. Disks are in one CFG file – disk-checks.cfg
  2. system services(HTTP,BIND,SSH,mySQL etc) are in system-checks.cfg
  3. Corporate applications in another – application-checks.cfg
  4. Checking links to external data sources are in a file called data-vendor-checks.cfg
  5. and lastly, network devices in network-checks.cfg

The rules are not hard and fast so feel free to setup what you want. The key is to partition the checks into related files, now to look at how the definition for the disk check is done so it can be re-usable.

Setting Up The Templates

Nagios defines a “generic-service” template, you can build on this to define the common parameters for related services, here is the service template for all my disk checks:

#==================================================
#
# Disk checks common to all linux systems
# 20-02-2009
#
#==================================================

define service {
use generic-service
name common-disk-params
max_check_attempts 3
normal_check_interval 1
retry_check_interval 1
active_checks_enabled 1
passive_checks_enabled 1
parallelize_check 1
obsess_over_service 0
check_freshness 0
notifications_enabled 0
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
contact_groups admins
notification_interval 360
notification_period 24x7
notification_options w,u,c,r
check_period 24x7
is_volatile 0
register 0
}

For my disk-checks.cfg file, all possible Linux and MS Window server disk checks re defined. Below are 3 checks defined for the root file system “/”,  /boot and /opt. Windows ones are defined at the end.

define service {
use common-disk-params
name check-disk-root
service_description DISK: /
check_command check_nrpe!check_disk_root
register 0
}

define service {
use common-disk-params
name check-disk-boot
service_description DISK: /boot
check_command check_nrpe!check_disk_boot
register 0
}

define service {
use common-disk-params
name check-disk-opt
service_description DISK: /opt
check_command check_nrpe!check_disk_opt
register 0
}

#==================================================
#
# Windows Disk Checks
#
#==================================================

define service {
use common-disk-params
name check-disk-c-drive
service_description DISK: C Drive
check_command check_nrpe!nt_check_disk_c
register 0
}

I normally define all possible disks I might have in a server so I can define a check in a host without having to add to the file, but feel free to implement how many checks you need to include.

Some things to note are the common name format “DISK: C Drive” or “DISK: /opt”, this allows the service checks to group together in the Nagios web page display and makes things look consistent (I’m big on consistent look and feel… imagine walking into a house and every wall in every room was a different color… yuk! the analogy applies to web page text displays).
The other beauty of the name formats is that you can deploy a common nrpe.cfg file to every host with the same OS and the NRPE check commands will be the same… I’ll cover NRPE in more detail later.

Sample Host configuration file

So now we have some services checked defined, lets see what a host looks like, the following sample file shows a DNS server with 3 checks, bind, ssh and the root file system.

#--------------------------------------------------
#
# DNS Server at the Primary Data Centre
# 2009-02-20
#
#
define host {
use generic-host
host_name dns-master.mydomain.com.au
alias BNE DNS master
address 192.168.1.1
check_command check-host-alive
max_check_attempts 5
contact_groups admins
notification_interval 120
notification_period 24x7
notification_options d,u,r
}

define service {
use check-sys-bind
host dns-master.mydomain.com.au
notifications_enabled 1
}

define service {
use check-sys-ssh
host dns-master.mydomain.com.au
}

define service {
use check-disk-root
host dns-master.mydomain.com.au
}

Yep, thats it!, 3 services defined. disk-checks.cfg is used for the check-disk-root service check, system-checks.cfg is used for the other 2 being BIND and SSH and the host definition uses the generic-host template in Nagios. You could even template this down so the common defines are pre specified so that the entry int eh host config file is even smaller.

Here is the definition for the two system checks so you get the idea….

define service {
use common-sys-params
name check-sys-bind
service_description SYS: DNS Bind
check_command check_tcp!53
register 0
}

define service {
use common-sys-params
name check-sys-ssh
service_description SYS: SSH
check_command check_ssh
register 0
}

The key things to notice are, systems services have a display name starting with “SYS:” so they group together, they all use a template called common-sys-params and by default all notifications are disabled unless specifically enabled at the service usage level, this is done more for convenience during setup, I can explicitly enable checks that must notify on error or warning conditions.

Check out my new posts on using Grafana, Carbon and Collectd to process performance data from systems.

Enjoy!

Advertisements