9er Demian recently wrote a blog post about the transition from Nagios to Prometheus & Alertmanager and how we use the data to trigger alarms and warnings based predictions and interpretations of linear time series data.
This is very important for the daily operations of a service provider where availability and performance is key. In addition to this, the collected data is also very valuable for troubleshooting.
There are often cases when a rule triggers a warning or alarm but the problem is actually caused by a different issue on the same, or another service. Having no data to find the source of the problem is usually NOT the problem - we record tons of metrics of all running services as well as the OS itself.
The challenge actually lies in the massive amount of information itself - our brains are just not capable of processing huge amounts of numbers to find correlations between different data services, or to spot patterns in a single time series.
What is Grafana?
Thankfully we are very good at spotting unusual behavior using our eyes to recognize patterns This brings us straight to Grafana:
“Grafana is a multi-platform open source analytics and interactive visualization software available since 2014. It provides charts, graphs, and alerts for the web when connected to supported data sources. It is expandable through a plug-in system. Wikipedia”
So how do we use Grafana to troubleshoot our services at Nine?
Grafana can access data directly from Prometheus and many Prometheus exporters even bring their own dashboards.
Reminder: An exporter exposes metrics of the monitored service to Prometheus.
The quality of the dashboards varies but they are usually a good starting point to develop your own ones.
A dashboard usually allows the selection of a single host and the selection of a time range. Based on this selection the dashboard will display all record metrics for the Prometheus exporter as charts.
Using this relatively easy setup we can already spot relationships between metrics of the very same service.
In the example above you can see that the service was running into problems because Apache could not serve any more requests because there were no more idle workers available. Just by looking at the charts we learned that we could change an Apache setting and potentially add a warning when we are running low on idle workers allowing us to act on this before any client services are affected.
If you can’t spot any patterns in a service dashboard like the one above then the cause for the issue might be a problem with the server itself, for example not enough RAM or a network issue. Conveniently, those metrics are exposed by the Prometheus node exporter.
In order to correlate problems to server metrics you have two options:
You build the mother of all dashboard which displays ALL Prometheus metrics for a host on a single page. I personally think this is not a good idea because even your eyes have a natural limit processing information, and also your browser might die quicker as you think.
A better option would be to link dashboards together. Conveniently, starting with Grafana 6.5 you can add the current dashboard parameters to an external link.
This allows you to quickly switch between the dashboards of different exporters.
Hint: To make this work you must use a common naming scheme for time, host name, and data source on each of your dashboards.
To allow even better troubleshooting in the future we plan to use Loki. Loki will allow us to integrate logs into Grafana so we may display log entries next to metrics in order to get even deeper insights. More on this in a later blog post.
And last but not least: Grafana is also a cool tool you can provide to your clients.
Everybody loves Charts!
If you would like to know more about our managed services offers please contact us.