Disk Alerts Automation

David Oct 6, 2017
Disk Alerts Automation

nine.ch provides, amongst other services, Managed Services to customers. These services are hosted on servers that contain least one disk for storage. We are currently managing over 4500 of these. To ensure our services are available around the clock, we need to make sure none of these disks get full. If a disk gets full, a service might stop function properly, as there will be no room to write anything else to the disk (logs, files etc.).

One of the missions of the Engineering team is to optimise existing processes, and make them as pleasant and time-efficient as can be for our users. We realised that the Disk Alerts process was a fairly time-consuming and (painfully) repetitive process for our Operations team, so we have thought of automating it.

Previous Process

Until recently, we were using a "semi-automated" approach that involved, in a nutshell, alerts from our monitoring system, various manual checks on the server and contacting the customer manually.

The main source of information was our monitoring system. It was keeping us updated as soon as a disk reached 80% or 90% of its capacity.

With such a system, we could react to emergencies, which was and still is essential, but our aim was to be able to do some prevention instead of being in "emergency-mode".

We also had two manual parts of the process that were quite exhausting for our Operations team that we somehow needed to improve: manual checks on the server and contacting the customer manually.

New Process

Assumptions

  1. There are other similar processes that would need a similar solution, so the solution should be easily extensible.
  2. Most of the disk problems currently happening cannot be resolved without the customer, so the solution should also contact customers.

new disk alerts automation process

In this solution, a few things have changed:

  1. We now have a clear separation between our monitoring system and this prevention system.
  2. We have swapped the manual checks on the server with the Disk Alerts messenger tool.
  3. We have also removed the need to contact the customer manually by introducing the Customer Interactor.

We are still using our monitoring system to alert us in case of emergencies (disk usage at 90% or more), and we rely on this solution to tell us when the disk usage is above 80%.

We will present to you our two new friends in the next sections:

Disk Alerts Messenger is the tool responsible to extract and send disk usage information.

Customer Interactor is the tool that is responsible for contacting the customer.

Disk Alerts Messenger

We call a messenger an application that sends messages over AMQP to a message broker.

Disk Alerts Messenger is a Debian package installed on each node (servers, virtual machines, Docker containers, AWS EC2 instances, ...) whose job is to report the disk usage of all the disks present on the node.

If a given disk usage is above the configured threshold (set at 80% by default), we scan the disk to see which files/folders are taking the most space and we send the top N of these (defaults to 25) via AMQP (we use RabbitMQ as our message broker).

If the disk usage is under the configured threshold, it sends another AMQP message to tell that the disk usage is normal/back to normal.

Customer Interactor

Given a certain message from a node, this application decides whom we should contact and, more importantly, if we should contact them at all. It is also responsible to decide whether an alert can be resolved by us, or only with the intervention of the customer.

For Disk Alerts messages, we first create an issue in our issue tracker with all the information needed to diagnose the problem (host, biggest files/folders, ...).

If the problem cannot be solved by us directly, we will additionally create a ticket in our Customer Service Software to inform them about the disk alert.

Any issues in our issue tracker will automatically be resolved if the latest disk alert happened more than 48 hours ago.

Pros/Cons of the Solution

Pros

  • Automation: the only human intervention is when a customer replies to the support ticket
  • Management of issues for free: built into our issue tracker
  • Decoupling of the publisher/consumer with AMQP

Cons

  • Disk Alerts Messenger is installed on every single node: that makes it a critical software (but low impact if crashing)
  • Resource-consuming to scan disks on all machines
  • Some disks are too big to be scanned altogether: will still need manual intervention if they are alerting, as we cannot determine by whom they should be solved (nine or the customer)

Where Are We Going Next?

Smarter AI & Auto-Resolve Mode

Another aspect will be to make the Customer Interactor smarter in diagnosing what the "problem" really is on the disk, along with a suggestion on how to fix the problem effectively.

Going one step further, once we know the problem and how to solve it, we could also make automated fixes when applicable (for instance, emptying the /tmp folder)

Data Mining

With the data coming into the Customer Interactor, we will be able to tell more accurately what are the main "problems" we have encountered so far, and how to prevent them effectively. It's only the beginning!