Pods are uanble to resolve DNS for any of Azure service or other external sites.

We've been running a couple K8 clusters for a couple months. Last weekend, everything stopped working. Specifically, DNS requests were failing. We investigated our network for any surprise changes and nothing changed in Azure. Our pods are unable to resolve DNS names.

Pods stopped resolving DNS names for Azure services such as Postgres, API Manager, REDIS, MongoDB Blob Store, etc. as well as some external services such as Auth0. Those same sites can be resolved if we test from the nodes on which the pods are running.

We should never experience DNS resolution issues. This was all working a week ago.

I can easily reproduce from within my pods. Not sure how you would reproduce if you're not experiencing DNS issues.

We have tried a bunch of things to resolve.

  1. Deleted the pods to get new replicas
  2. Rebooted the VMs
  3. Deleted the Kube-system DNS services to get new replicas

Nothing works.

As referenced above, we can telnet anywhere from any of the nodes without issue. But from within the pod it fails.

Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:50:14 2017
OS/Arch: linux/amd64
Experimental: false

Master - resolve.conf

โœ”๏ธAccepted Answer

Here is a quick status update:

There are two problems happening concurrently, while similar they are not related:

  1. kubedns stops resolving names and logs 'i/o timeoutwhile connecting to VNET DNS server. A transient connection error - to metadata endpoint becomes persisted error in kubedns. The pod network namespace can connect to external and internal vnet ips. And you can confirm this by
kubectl --namespace=kube-system exec -it ${KUBE-DNS-POD-NAME} -c kubedns -- sh
#run ping/or nslookup using metadata endpoint

restarting the pod and or the container should suffice to fix this.

Stop this from happening
edit kubernetes dns add on master (repeat for every master)

vi /etc/kubernetes/addons/kube-dns-deployment.yaml

Change the args for healthz container to following

- "--cmd=nslookup >/dev/null"
- "--url=/healthz-dnsmasq"
- "--cmd=nslookup >/dev/null"
- "--url=/healthz-kubedns"
- "--port=8080"
- "--quiet"

Instead of using nslookup kubernetes.... This will force the kubedns container to restart if the above condition occurs.

  1. kubedns entire network namespace loses connection to internal (metadata endpoint) and external ips. This has been observed on Azure CNI but has not been confirmed on other CNI yet. to confirm this. jump into any of the containers in kubedns pod and test the network (even curl will fail, while other pods on the same node is functioning properly).


  1. move the pod to a different node.
  2. restart the node.

We are actively working on getting RCA for this issue.

