nine.ch provides, amongst other services, Managed Services to customers. These services are hosted on servers that contain least one disk for storage. We are currently managing over 4500 of these. To ensure our services are available around the clock, we need to make sure none of these disks get full. If a disk gets full, a service might stop function properly, as there will be no room to write anything else to the disk (logs, files etc.).
One of the missions of the Engineering team is to optimise existing processes, and make them as pleasant and time-efficient as can be for our users. We realised that the Disk Alerts process was a fairly time-consuming and (painfully) repetitive process for our Operations team, so we have thought of automating it.
Until recently, we were using a "semi-automated" approach that involved, in a nutshell, alerts from our monitoring system, various manual checks on the server and contacting the customer manually.
The main source of information was our monitoring system. It was keeping us updated as soon as a disk reached 80% or 90% of its capacity.
With such a system, we could react to emergencies, which was and still is essential, but our aim was to be able to do some prevention instead of being in "emergency-mode".
We also had two manual parts of the process that were quite exhausting for our Operations team that we somehow needed to improve: manual checks on the server and contacting the customer manually.
In this solution, a few things have changed:
We are still using our monitoring system to alert us in case of emergencies (disk usage at 90% or more), and we rely on this solution to tell us when the disk usage is above 80%.
We will present to you our two new friends in the next sections:
Disk Alerts Messenger is the tool responsible to extract and send disk usage information.
Customer Interactor is the tool that is responsible for contacting the customer.
We call a messenger an application that sends messages over AMQP to a message broker.
Disk Alerts Messenger is a Debian package installed on each node (servers, virtual machines, Docker containers, AWS EC2 instances, ...) whose job is to report the disk usage of all the disks present on the node.
If a given disk usage is above the configured threshold (set at 80% by default), we scan the disk to see which files/folders are taking the most space and we send the top N of these (defaults to 25) via AMQP (we use RabbitMQ as our message broker).
If the disk usage is under the configured threshold, it sends another AMQP message to tell that the disk usage is normal/back to normal.
Given a certain message from a node, this application decides whom we should contact and, more importantly, if we should contact them at all. It is also responsible to decide whether an alert can be resolved by us, or only with the intervention of the customer.
For Disk Alerts messages, we first create an issue in our issue tracker with all the information needed to diagnose the problem (host, biggest files/folders, ...).
If the problem cannot be solved by us directly, we will additionally create a ticket in our Customer Service Software to inform them about the disk alert.
Any issues in our issue tracker will automatically be resolved if the latest disk alert happened more than 48 hours ago.
Another aspect will be to make the Customer Interactor smarter in diagnosing what the "problem" really is on the disk, along with a suggestion on how to fix the problem effectively.
Going one step further, once we know the problem and how to solve it, we could also make automated fixes when applicable (for instance, emptying the /tmp folder)
With the data coming into the Customer Interactor, we will be able to tell more accurately what are the main "problems" we have encountered so far, and how to prevent them effectively. It's only the beginning!