Status: Complete, action items in progress
Due to a change to one of our components, a network-policy was
deployed into the
kube-system namespace which disallowed traffic from
customer-namespaces to any pod running in
kube-system. Most importantly, this
DNS Resolution (and other traffic to pods running in
did not work, meaning that applications that depend on resolving a domain name,
even just an internal service name, failed.
For our nine Managed GKE product, we separate our managed
applications from customer applications by running them on separate nodes and
having them in separate,
nine- prefixed namespaces. We have controllers and
webhooks in place that label every namespace with
nine respectively. If that label should indicate that a
namespace is a nine-namespace, a NetworkPolicy
is deployed that blocks all traffic from non-nine namespaces to the pods running
in said namespace.
For our upcoming NKE product, we have deployed a change to these controllers to
retroactively add the nine-label to all namespaces that are considered to be in
our control. The issue with that change was that the controllers also consider
kube-system and the
default namespace to be a nine-namespace and thus
added the labels and deployed the network policies to these namespaces as well.
These in turn then blocked all traffic not coming from nine-namespaces,
breaking DNS-resolution from customer’s pods. Because our monitoring also runs
in nine-namespaces, we had not gotten an alert until we’ve got the first
customer inquiry related to this issue.
As soon as we had located the issue, we reverted the changes to our controllers and started removing the labels from the namespaces again. With the labels removed, we could then also remove the network policies to fix the issue.
The issue was detected by one of our customers, who contacted us to ask about the root cause.
- Add DNS Monitoring to nMGKE and NKE, checking if DNS itself is working.
- Do a full end-to-end test with a fake customer application in our pipeline, ensuring that Kubernetes functionality works from the customer’s point of view.
- 23.11.2021 12:14 CET changes to the controllers where rolled out onto all clusters.
- 23.11.2021 12:57 CET got contacted by the first customer (and soon after by other customers as well) that DNS resolution was not working as expected.
- 23.11.2021 13:05 CET all hands on deck to figure out the root cause and find ways to mitigate the issue.
- 23.11.2021 13:13 CET triggered the pipeline to revert the changes that caused the issue.
- 23.11.2021 13:25 CET undid the changes that had been made by the controllers (reverting the code did not undo the changes) in all projects, effectively resolving the issue.
- 23.11.2021 13:38 CET confirmed that DNS resolution worked as expected again.