Network Policies block DNS resolution

Ramon Rüttimann Nov 25, 2021

Status: Complete, action items in progress

Summary

Due to a change to one of our components, a network-policy was deployed into the kube-system namespace which disallowed traffic from customer-namespaces to any pod running in kube-system. Most importantly, this made kube-dns unavailable.

Impact

DNS Resolution (and other traffic to pods running in kube-system) did not work, meaning that applications that depend on resolving a domain name, even just an internal service name, failed.

Root Causes

For our nine Managed GKE product, we separate our managed applications from customer applications by running them on separate nodes and having them in separate, nine- prefixed namespaces. We have controllers and webhooks in place that label every namespace with nine-namespace-type customer or nine respectively. If that label should indicate that a namespace is a nine-namespace, a NetworkPolicy is deployed that blocks all traffic from non-nine namespaces to the pods running in said namespace.

For our upcoming NKE product, we have deployed a change to these controllers to retroactively add the nine-label to all namespaces that are considered to be in our control. The issue with that change was that the controllers also consider kube-system and the default namespace to be a nine-namespace and thus added the labels and deployed the network policies to these namespaces as well. These in turn then blocked all traffic not coming from nine-namespaces, breaking DNS-resolution from customer’s pods. Because our monitoring also runs in nine-namespaces, we had not gotten an alert until we’ve got the first customer inquiry related to this issue.

Resolution

As soon as we had located the issue, we reverted the changes to our controllers and started removing the labels from the namespaces again. With the labels removed, we could then also remove the network policies to fix the issue.

Detection

The issue was detected by one of our customers, who contacted us to ask about the root cause.

Action items

  • Add DNS Monitoring to nMGKE and NKE, checking if DNS itself is working.
  • Do a full end-to-end test with a fake customer application in our pipeline, ensuring that Kubernetes functionality works from the customer’s point of view.

Timeline

  • 23.11.2021 12:14 CET changes to the controllers where rolled out onto all clusters.
  • 23.11.2021 12:57 CET got contacted by the first customer (and soon after by other customers as well) that DNS resolution was not working as expected.
  • 23.11.2021 13:05 CET all hands on deck to figure out the root cause and find ways to mitigate the issue.
  • 23.11.2021 13:13 CET triggered the pipeline to revert the changes that caused the issue.
  • 23.11.2021 13:25 CET undid the changes that had been made by the controllers (reverting the code did not undo the changes) in all projects, effectively resolving the issue.
  • 23.11.2021 13:38 CET confirmed that DNS resolution worked as expected again.