Today I knocked up a fully functional, very cool (kewl?) performance metrics dashboard in a few hours with live data streaming from our Cloud Cluster… I was particularly impressed with the ease it went together, all thanks to a bunch of technologies that worked virtually out of the box.
The (tool)box consisted of:
- Generating CollectD style performance messages.
- Sending then to a RabbitMQ Message Queuing server.
- Using a backend product called “Carbon” subscribed to the MQ exchange which then stores the message into a NoSQL style DB called “Whisper”.
- Then using the graph rendering engine “Graphite” to generate the graphs (on demand) and display them on my PHP powered dashboard.
The dashboard part was knocked up in PHP, it just outputs the page so that the client browser calls the rendering engine to build and display the graphs. The page is a tabbed design using DOMTab. I got slack and used tables to layout the graphs but I think that’s going to be OK..
The script to generate the metrics is very easy, I had originally proposed to use JSON but the CollectD-Graphite format was even easier to get working. It looks like this:
<path>.<metric> <value> <unix_time>
so we have:
collectd.server.domain.com.au.load-ave 0.25 1390475213
UPDATE: After some live running for a few onths, we have found that just using the hostname without the domain portion is easier to parse.
The path in our case starts with “collectd” but I will change this to “Servers”, “Containers”, “Virtual” or “ApplicationData” when we divide our metrics up by source type.
The domain part is the server name. In the example the metric is “load-ave”, then there is the value, in this case 0.25 and lastly the UNIX time.
The script I wrote basically creates this message and then writes it to a RabbitMQ server. The MQ server has a specific exchange setup and a routing key defined. The exchange type is amq.topic and the routing key allows the “Carbon” data collection daemon to subscribe to the same events and process them. Testing is easy, create a queue and set the binding key to “#” and the exchange to the same as that used by the apps generating data.
Changing your mindset
Its important to understand why we went this direction, firstly, Nagios/Icinga is great but the model of interrogating 10,000 servers is not going to work. Out of the box Nagios does not have any native clustering support and trying to get NDO working on this scale is going to be a pain. The model we have introduced means the clients push at 1 minute intervals all the metrics they can, with out any concern as to what is processing them. The RabbitMQ server supports a fully integrated clustering solution out of the box, and it means getting data from system to system is easy. Collecting metrics and piping them off to other systems is also easy.
Being a Nagios/Icinga admin I know how long it takes to configure new servers and services into the Nagios configuration…. not anymore. Carbon will AUTOMATICALLY store the data for new servers (the <path>.<metric> component of our message above) so no configuration is necessary to bring systems online and start storing their metrics… ANY metric!
The dashboard is the only component I needed to start designing but since I wrote mine I have discovered a range of “Graphite” based Dashboard software. The Graphite Browser allows you to craft a new graph from the available metrics, but using PHP I dynamically generated the URL needed to kick off the renderer to produce the graphs I wanted.
The CollectD application we initially installed with yum did not have working RabbitMQ v0.4 support. I downloaded CollectD-5.4.0 and then modified the amqp.c file to use the new tcp socket connection code that was introduced in v0.4, in the process I forked a new copy of the CollectD code base and saved my changes back to Github so others can build from it.
CollectD generates a lot of metrics, I set our installs to have an “Interval” of 60 seconds, normally its set to 10 seconds. And it generate around 130-140 message each minute per server with a basic install and nothing extra turned on. So expect to see a lot of traffic!
Short of writing your own CollectD plugin its dead simple to produce the output message format needed for Carbon-Graphite to store and graph your data, so simple shell scripts generating the metrics I needed and pushing them to the MQ server took very little time to write.
Update: I have a sample here of a simple BASH script to push data to the metrics system.
- Fortigate SNMP Notes – Part 2 – Contains a script to push collectd style metrics gathered from a Fortigate firewall to the RabbitMQ server.