Changing the whole monitoring infrastructure is one of those things which is harder to justify as it takes some time to implement and you won‘t see it‘s effects immediately.
But getting rid of technical debt and allowing faster iteration and improvement is a key part of innovating in the technical space in general. This post is about how we were able to improve with our new monitoring infrastructure.
We recently reworked our whole monitoring approach for managed services here at Nine. To get from a blackbox monitoring (Is it down?) to a whitebox monitoring system (Why is it down?).
This also implied a move away from our old system based on Nagios to a solution around Prometheus, which gives us much more insight into our infrastructure.
In Greek mythology, Prometheus is a Titan, culture hero, and trickster figure who is credited with the creation of humanity from clay, and who defies the gods by stealing fire and giving it to humanity as civilization.
Or in this case it records real-time metrics in a time series database built using a pull model, with flexible queries and real-time alerting.
The new monitoring and how does it help us
We already learned that Prometheus collects time based metrics. Our previously used solution based on Nagios only reported the current values of a monitored metric.
What’s the difference? Prometheus allows for calculating and alerting on averages or even predictions. And it can also be used for visualization. But more about that later.
As an example:
Brian Brazil puts it like this:
How often have you gotten alerted about disk space going over some threshold, only to discover it'll be weeks or even months until the disk actually fills?
Let’s say you create an alert on a 80% threshold. If you have a 10TB disk it will alert despite the fact that you still have 2TB space left. If the customer does not add any data there’s no reason to alert at this point.
Prometheus allows us to make a linear prediction to only alert when necessary:
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
summary: "Out of disk space in less than 4 hrs (instance )"
description: "Disk has a size left and will run out of space in less than 4 hrs"
This alert will inform an engineer if the disk will fill in four hours if data gets added at the same rate as it did in the last hour.
Prometheus scrapes/pulls metrics from applications themselves if they expose them or from exporters which act as a gateway to connect to the application and provide the metrics in the Prometheus format.
Alerting rules are evaluated on Prometheus and are pushed to Alertmanager. Alertmanager will then forward the alert to the corresponding notification service.
As Prometheus is a time series database we don’t only use it for alerting. The same data is used for visualization. For example we can inspect some metrics about a node in Grafana. But more on this in a follow up post.
Why Prometheus is for everyone
After we were using Prometheus already for the monitoring of our managed container platform, we now monitor our whole infrastructure with Prometheus. That's more than 3.5 Million metrics on over 7000 services in three data centers in Zurich.
But Prometheus is not only good in monitoring infrastructure. It’s also a great companion in monitoring applications. That’s why we also use it to monitor the insight of our own tools and services. There are client libraries for most languages to get started easily with instrumenting an app for Prometheus.
Some of our engineers even store metrics of sensors in their House with Prometheus.
If you want to know how Prometheus can help you to get insights into your application feel free to contact us.
Never miss an update!