How to get the most out of the nine managed services SLA product with Statuscake

Eric Funk Apr 17, 2020
How to get the most out of the nine managed services SLA product with Statuscake

As a client of nine you can be sure that we will go the extra mile to keep your managed web server or your database up and running. We just recently replaced our monitoring with Prometheus /Alertmanager, but despite this much improved technology stack we still have an uncovered area - anything run by our clients on our infrastructure. Those are traditionally websites or custom applications like Node.js or Java processes.

One would ask himself - how on earth would nine know all those client applications and how to fix them in the middle of the night? Well - we have an honest answer - we don’t!

Maybe we know some applications over time because we have established strong partnerships with our clients - still this is not reliable because they change over time and we’re not at the heart of the development.

How can nine offer a Service Level Agreement (SLA) for your application?

The answer is simple - with your help!

As part of the SLA order process you receive a form from our sales department asking for a string check, an action plan, and an escalation plan. We understand that this is sometimes a lengthy process or even a seemingly impossible task because the customer did some remote development. But trust us - it’s worth it. 

Let's start with the string check. Even if we did our best and your web server is running it doesn't necessarily mean that your application is healthy. You would say - my website is healthy if I enter the URL in my browser and I see my content. Or if a customer is able to place an order in an online shop and this order actually ends up in the database. Right - but how do we describe this to a machine? 

We use Statuscake for this task. Statuscake periodically visits your website from different locations just like yourself from your office. To avoid false-positives alerts, we focus on using Europe based check locations.

What’s a good SLA check?

In the easiest case we use Statuscake to check if your website is returning a HTTP 200 “OK” status code. If your website has any other status code like a 50x or 40x it's an indicator that there is a problem. 

The problem here is that your application might be serving a page stating technical issues with a 200 status code. You as a person would instantly see that there is a problem, however machines can’t!

This is, where the so-called string checks come into play. In this setup, Statuscake will check for specific text in the server response in addition to the 200 response code. For example, this could be a copyright text in the footer of your website or an element not directly visible in the HTML. 

It's an improvement — but still — what if the error page uses the same footer or if a content editor makes a change to the text in the middle of the night? This would trigger an alarm but there is nothing broken at all! For the sake of the mental health of the on-call person receiving the alert in the middle of the night we should prevent this by any means! 

This brings us to the preferred option — a dedicated status check on your website. It takes some time to develop this, but many web frameworks like “Spring Boot” (see Actuator) already offer this out of the box.

What makes a good status check?

  • It resides on a dedicated URL - for example https://www.mysite.com/status
  • It’s not affected by a graphical redesign or content update
  • It should never do a redirect
  • It checks all subsystems required to make your website work. If you use a database - check if you can write and read data. If you use a Redis cache - check if it works. And please - tell us which one is failing.
  • Don't rely on any external sources. This check should reflect your application running on the nine infrastructure. External components should never be part of this.
  • If you are deploying new versions of your application and a downtime is expected, the string check should not fail or be inaccessible.

Example:

curl -I https://www.mysite.com/status
< Date: Tue, 04 Nov 2014 19:12:59 GMT
< Content-Type: application/json; charset=utf-8
< Status: 200 OK
{“status”:”ok”}

curl -I https://www.mysite.com/status
< Date: Tue, 04 Nov 2014 19:12:59 GMT
< Content-Type: application/json; charset=utf-8
< Status: 500 Internal Server Error
{“status”:”database schema has changed”}

Now we know your website is failing and potentially also already why, but we still don't know how to fix it.

This brings us to the so-called action plan and the escalation plan.

What’s a good action plan?

The action plan is a manual for the on-call engineer. It describes in detail what to do in case of an emergency. Without an action plan, we will of course use standard measures to get your application back online again (like checking log files for anomalies, restarting services if needed). But if you provide us with an action plan, we are able to take specific measures tailored to fit your application and make it available again as fast as possible.

A action plan might contain instructions such as:

  • to clean application specific caches 
  • to enable a special maintenance page
  • to restart the application. 

We will review the instructions during the onboarding process and give valuable feedback on how to address those problems before they actually escalate into a failing status check.

A good rule of thumb here is: if the application or a component can detect the issue on its own, concrete measures should be automated here without the need of our help.

If the engineer cannot fix the source of the problem, we will use the contacts provided in the escalation plan to reach out to your technical support. Statuscake also offers a notification feature which could automate this step but we don’t use this yet. 

Are you a developer and have feedback regarding the criteria for a good string check? Please let us know by using the social feedback channels. We’re keen on your experiences and are happy to use your input for further changes to our onboarding processes and runbooks.

If you are interested in our SLA product or if you want to take advantage of better status check please don’t hesitate to contact us. 

Contact nine now