Nine handling a major incident in times of the coronavirus

Nine handling a major incident in times of the coronavirus

... or how we almost locked ourselves out of our managed servers!

Thursday, 19 March 2020 starts as a "normal" day in home office. The whole team at nine has been working from home since the beginning of the week and has meanwhile adjusted to the new situation to some extent. We use Slack or Hangouts for our daily stand-ups or meeting video calls and discuss on Slack what the more stable system is. 

11:00 We realize that there is a problem - one of our customers can't connect via VPN and our monitoring alerts us that some servers are down. But the exact extent of the problem is still unknown - is it a small outage or a crisis?

Fact is - we must react immediately! There are more and more critical warnings from internal and external monitoring systems and individual employees report via Slack, that they are unable to connect to servers and VPN connections are being rejected. First analysis is taking place within the team and it is obvious that something is wrong with the network, perhaps the DNS resolvers. We have a crisis!

11:20 The officer of the day Tajno starts a dedicated Slack channel. Stefan and Patrick from the IT / Managed Services team take over the coordination in the teams and Kyon from the Customer Service Desk (CSD) takes over the external communication.  

Of course, our customers have noticed that their websites and the mail system are no longer working and the CSD has its hands full answering questions from customers. The question is mostly the same: "How long will it take until my website is back or until I can send important emails again?” We don't know the answers either, as we are just as much in the dark at this point. At least we know by now that the cause is not the network. This finding is not that easy for us, because the monitoring systems are not accessible either.

But we quickly notice that there are still open SSH connections to our servers and only new connections are rejected. We no longer believe that our DNS resolvers are the root problem. Yes - they don't work anymore. But that's just another symptom, because we can't connect ourselves over a direct IP on some of our servers:

We are effectively locked out from our own servers! 

12:10 Lukas drives to the data center, Team Managed Services - especially André - switches to hacker mode.

Without access to our own servers and without DNS we are effectively stuck. Now there are two options - physical access to the server or access via a management system, which we also use in our daily business to repair servers that no longer boot. André has found a working management server through our emergency system. On this server he uses Google DNS to find out the IP of our resolver:

dig nsr1.nine.ch @8.8.8.8 

He can connect to the IP via SSH. Success. We are getting closer to the cause of the problem. 

This trick is also used by our root customers. They bring their servers back online themselves by switching from nine resolvers to other working DNS resolvers.

It quickly becomes clear that the firewall rules, which otherwise prevent unauthorized access to the system, have been changed. They are set to a default setting that does not even allow nine employees to log in from the management system and suddenly are asked for passwords. Similarly, access to our monitoring and services such as web servers, databases, DNS etc is not possible. We observe the same behavior on other servers. The rules are quickly adjusted manually and the first system is working again. But now we potentially still have thousands of affected systems.

12:35 André reports via Slack that the first resolver is available again. The task force, which now consists of all Managed Services team members, starts to restore the remaining nine systems and coordinates via Slack.

Now that we are effectively back to our internal systems via DNS, we are starting to restore the customer systems with an SLA. But then the unexpected happens! 

Already "fixed" systems are broken again.

Take a breath - there is only one reason why something like this can happen. We disable Puppet. Systems where Puppet is disabled are permanently fixed.

Puppet is our automation software that manages every other aspect of a server - software installation, security and upgrades. 

Why are you doing this to us, Puppet? Is it a faulty commit that tells Puppet to make this change? Josi gives the all-clear - There are no suspicious commits in  the Git history. We remember that a new Puppet package was installed on our servers just that day as part of our regular maintenance. It was a patch with no functional changes according to the change log. The observed symptoms do not match this update at all. Nevertheless - the problem is now definitely Puppet. 

13:15 Phil informs our customers on https://status.nine.ch that the root cause has been found and we are now working on the solution.

Daniel assumes that puppet has "forgotten" the assignment of the nodes based on the certificates to the corresponding node manifests and thus uses a default configuration. Patrick reads something about a certificate error in the Puppet logs.

image4

A first test with a FQDN instead of the hostname provides clarity. It must be related.

Parallel to the engineers who fix our infrastructure and customer systems in a rush, our code specialist Demian starts to analyze the Puppet problem in detail. Puppet is Open Source Software - every change made by the developers is traceable in the program code.

Antonios sends a link via Slack:
https://nvd.nist.gov/vuln/detail/CVE-2020-7942#VulnChangeHistorySection, 

Could this change be the cause? Quickly the packages from Puppet are reverted to the previous version (5.2.13 to 5.2.12), but the problem is not fixed yet. Parallel to the engineers who are fixing our infrastructure and customer systems in a rush, our code specialist Demian digs in Puppet's code and finds a commit that looks strange: https://github.com/puppetlabs/puppet/commit/df826baa0ed1f3ebb182798aa6e04a9e8f35fd80#

(PUP-10238) Change default value of strict_hostname_checking to true

Previously our default value of strict_hostname_checking was false which allowed matching dotted segments of a nodes certname (its CN in its certificate) as well as the segments of its fqdn fact, or hostname + domain fact.

image2-1

Got it! A test, where we define a server in Puppet Manifest using a FQDN instead of just the server name, shows that Puppet works again afterwards.  

Because the load balancer for Puppet on the Puppet master is switched off for known reasons, we carry on testing by letting the Puppet client access one of the other servers instead of the master:

puppet agent --test --noop --server puppetslave.nine.ch

image1-1

It's great that we found a fix for the problem, but we can't just rename all server manifests in such a short time! We need the original behavior from before the default was changed. Fortunately, this can be defined globally on the Puppet server, permanently overwriting the changed default, which was not reset by the downgrade.

15:28 Status Update - All systems using IPV4 are available again.

Parallel to the Puppet fix, the rest of the task force has restored the IP tables using a bash script. 

for i in $(list_managed_servers); do echo $i; ssh $i "cd /etc/iptables && git checkout @{yesterday} -- rules.v* && iptables-restore < /etc/iptables/rules.v4"; done

for i in $(list_managed_servers); do echo $i; ssh $i "cd /etc/iptables && ip6tables-restore < /etc/iptables/rules.v6”; done

16:30 we can alleviate the following message “All systems are now reachable again over IPv4 and IPv6.”

Until late in the evening, Managed Services engineers check all systems and finally turn on Puppet again. 

After 7:39:07 hours, Tajno ends the Slack call.

image3-1

All in all, we were lucky in our misfortune. Because Puppet also disabled our DNS resolvers and blocked its own services via firewall, the error could not spread to all systems.

What did we learn from this? We learned a lot! So far 14 tickets have been created with measures that we have to take to optimize our processes and engineering. For instance that an emergency wiki, email or password tool must not only be at a different location, but also completely independent from the nine.ch domain.

We will be focusing on optimizations in customer communication as well as security and process improvements around the topics Puppet and patching. The Managed Services team will soon be publishing further blogs on this topic.

What experiences have you made with your configuration management tools? Share your stories with us via Twitter, LinkedIn or Youtube.

Don't miss any blog posts:

Subscribe now for the nine blog

 

status updates:

https://status.nine.ch