Solvedacs engine Pods are uanble to resolve DNS for any of Azure service or other external sites.

Is this a request for help?:

Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):

Issue

What version of acs-engine?:
1.31.1

Kubernetes

If this is a ISSUE, please:

We've been running a couple K8 clusters for a couple months. Last weekend, everything stopped working. Specifically, DNS requests were failing. We investigated our network for any surprise changes and nothing changed in Azure. Our pods are unable to resolve DNS names.

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

Kubernetes

What happened:

Pods stopped resolving DNS names for Azure services such as Postgres, API Manager, REDIS, MongoDB Blob Store, etc. as well as some external services such as Auth0. Those same sites can be resolved if we test from the nodes on which the pods are running.

What you expected to happen:
We should never experience DNS resolution issues. This was all working a week ago.

How to reproduce it (as minimally and precisely as possible):
I can easily reproduce from within my pods. Not sure how you would reproduce if you're not experiencing DNS issues.

Anything else we need to know:

We have tried a bunch of things to resolve.

  1. Deleted the pods to get new replicas
  2. Rebooted the VMs
  3. Deleted the Kube-system DNS services to get new replicas

Nothing works.

As referenced above, we can telnet anywhere from any of the nodes without issue. But from within the pod it fails.

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:50:14 2017
OS/Arch: linux/amd64
Experimental: false

Master - resolve.conf
nameserver 168.63.129.16
search n2wydozmtkcurochzy4mep2cdc.ax.internal.cloudapp.net

44 Answers

โœ”๏ธAccepted Answer

Here is a quick status update:

There are two problems happening concurrently, while similar they are not related:

  1. kubedns stops resolving names and logs 'i/o timeoutwhile connecting to VNET DNS server. A transient connection error - to metadata endpoint becomes persisted error in kubedns. The pod network namespace can connect to external and internal vnet ips. And you can confirm this by
kubectl --namespace=kube-system exec -it ${KUBE-DNS-POD-NAME} -c kubedns -- sh
#run ping/or nslookup using metadata endpoint

Fix
restarting the pod and or the container should suffice to fix this.

Stop this from happening
edit kubernetes dns add on master (repeat for every master)

vi /etc/kubernetes/addons/kube-dns-deployment.yaml

Change the args for healthz container to following

- "--cmd=nslookup bing.com 127.0.0.1 >/dev/null"
- "--url=/healthz-dnsmasq"
- "--cmd=nslookup bing.com 127.0.0.1:10053 >/dev/null"
- "--url=/healthz-kubedns"
- "--port=8080"
- "--quiet"

Instead of using nslookup kubernetes.... This will force the kubedns container to restart if the above condition occurs.

  1. kubedns entire network namespace loses connection to internal (metadata endpoint) and external ips. This has been observed on Azure CNI but has not been confirmed on other CNI yet. to confirm this. jump into any of the containers in kubedns pod and test the network (even curl https://10.0.0.1 will fail, while other pods on the same node is functioning properly).

Solutions:

  1. move the pod to a different node.
  2. restart the node.

We are actively working on getting RCA for this issue.

Related Issues:

13
acs engine Pods are uanble to resolve DNS for any of Azure service or other external sites.
Here is a quick status update: There are two problems happening concurrently Is this a request for h...
3
acs engine The cluster-internal DNS server cannot be used from Windows containers
The DNS issue does not seem unique to Windows (#2999 #2880) as of a few days ago ...
96
sops Cannot decrypt with GPG 2.2.5 and SOPS 3.0.0
The problem suddenly re-occured.. I think it has to do with the gpg-agent For the moment this solved...
42
azure cli az aks list - No module named '_cffi_backend'
Same here Update: python3 -m pip install cffi fixed it for me. Command Ran: az aks list Fixed issue ...
38
azure cli zsh autocompletion on OSX/macOS
I also just encountered this Every time I launch my terminal I get the following error: /Users/blake...
33
azure quickstart templates resourceId() support for multi-segment sub-resources?
I think I figured this out: funny how none of the quick-start examples do this ...
27
azure cli 'az aks browse' doesn't work. Error listening on port 8001
Found a workaround Describe the bug This command used to work for me but not recently even after I r...
18
terraform provider azurerm key vault soft-delete causing failure when updating secrets or certificates
I'm seeing exactly the same issue key vault soft-delete causing failure when updating secrets and ce...
16
azure powershell 'az account get-access-token' equivalent in Azure PowerShell
Get-AzAccessToken is available in Az.Accounts 2.2.0 which will be released on Nov 17 please have a t...
14
terraform provider azurerm Azurerm_frontdoor with v2.24.0 breaks when azure frontdoor is edited in portal.
Same issue here but on another level: Error: flattening frontend_endpoint: ID was missing the frontD...
13
terraform provider azurerm r/application_insights: support for the Workspace model
Until this is supported natively by terraform This issue was originally opened by @klainn as hashico...
13
azuredatastudio Connecting to remote SQL Server from Ubuntu 20.04
hello i am so happy to tell you that i have a solution i tested and works perfectly is a downgrade o...
13
pulumi Resource Adoption
I have a workaround until this feature is shipped While pondering #1654 ...
12
azure cli az ad app permission grant not working or usable as expected
az ad app permission admin-consent is the old way of granting all Application Permissions and Delega...
12
azure cli unable to update cli using homebrew
I ran brew reinstall python3 to fix this I tried to install the latest CLI from the docs: https://do...
12
azure sdk for net ClientServerCredential equivalent for simple apiKey authentication?
Here is my version of the ServiceClientCredentials for apiKey Is there support for swagger's concept...
11
azure cli DLL load failed while importing win32file: The specified module could not be found.
I was able to solve it by reinstalling pywin32 with a terminal Run as Administrator: Though the root...
11
terraform provider azurerm TF apply/plan have different constraints for "Consumption" tier for API Management
Looks like there's a bit more to this than just the capacity tier of zero Community Note Please vote...
10
terraform provider azurerm multi-tenant deployment using Azure shared image gallery via terraform
@nicethomaslearngit I was having the same issue as you and seem to have figured it out ...
7
azure cli az container create fails with AttributeError: 'ResourcesOperations' object has no attribute 'create_or_update'
Ubuntu hosted agents were updated to azure-cli 2.24.0 today Describe the bug Command Name az contain...
6
terraform provider azurerm Failed to destroy azurerm_key_vault and associated azurerm_key_vault_access_policy (30 minute timeout)
@katbyte given that you've just updated the CHANGELOG.md for the v2.50.0 release (b74f30f) and there...
6
terraform provider azurerm conflict between azurerm_subnet_route_table_association and azurerm_subnet route_table_id
hey @steve-hawkins As a workaround I use a lifecycle in azurerm_subnet : I don't know if this is the...
6
terraform provider azurerm Terraform loses access token and requires az login
At least update the error message from terraform to include the workaround using az account get-acce...
5
spark [BUG]: Trying to follow the "Getting Started" guide step by step
One temporary work-around to avoid seeing the spark temporary files error is to add the following tw...
3
azure sdk for net [BUG] Enabling VisualStudioCodeCredential on macOS crashes dotnet
Why is this bug closed? This workaround breaks on Windows it feels a bit crispy to add this to every...
3
caprover 502 Bad Gateway error with REST API
If you can reproduce the issue with an open source project I can look into it Hello! I really enjoy ...
3
terraform provider azurerm v1.0.0 checksum mismatch error
Hey @aoggz Thanks for opening this issue - apologies about this This was due to a re-release ...
3
pulumi Python Outputs to string do not work as documented
Thanks for the clarifications and feedback I agree we can definitely work toward making the document...
3
pulumi Grpc.Core.RpcException: "Failed to deserialize response message." when using "cert-manager.crds.yaml" for ConfigFile
Here is the draft of the forking plan: Fork protocolbuffers/protobuf to Pulumi org (done): https://g...
720
distribution Private registry push fail: server gave HTTP response to HTTPS client
I get helped from [http://stackoverflow.com/questions/38695515/can-not-pull-push-images-after-update...
523
kubernetes deleting namespace stuck at "Terminating" state
@ManifoldFR I had the same issue as yours and I managed to make it work by making an API call with j...
447
moby The name "/data-container-name" is already used by container <hash>. You have to remove (or rename) that container to be able to reuse that name.
I have a helper function to nuke everything so that our Continuous blah cycle can be tested erm.. co...
279
kubernetes PV is stuck at terminating after PVC is deleted
I got rid of this issue by performing the following actions: Then I manually edited the pv individua...
271
kubernetes x509 cert issues after kubeadm init
do you have $KUBECONFIG pointing to /etc/kubernetes/kubelet.conf? BUG REPORT: (I think?) What happen...
264
kubernetes Ingress: Allow for multiple hosts
I also would like to see this feature but as a workaround I use YAML ids Here is how it would look f...
225
kubernetes The connection to the server localhost:8080 was refused - did you specify the right host or port?
Run these commands solved this issue: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HO...
210
minikube minikube start - Error starting host, machine does not exist
On macOS Sierra rm -rf ~/.minikube fixed it for me After that minikube start worked as expected mini...
208
kubernetes Force pods to re-pull an image without changing the image tag
@yujuhong Sometimes it's very useful to be able to do this For instance Problem A frequent question ...
205
moby docker-engine 1.10.2-0~trusty can't install on clean Ubuntu 64-bit 14.04.3
I seem to have resolved this by putting deb http://cz.archive.ubuntu.com/ubuntu trusty main in /etc/...
183
moby Docker service update --image "could not accessed on a registry to record its digest"
When updating services that need credentials to pull the image you need to pass --with-registry-auth...
182
kubernetes 'unknown revision v0.0.0' errors, seemingly due to 'require k8s.io/foo v0.0.0'
For anyone else who hits this issue after much weeping and gnashing of teeth this is the little scri...
148
minikube kube-proxy configmap update: timed out (unknown root cause)
I had this error when upgrading from 0.25 to 0.26.1 Simply performing minikube delete and then re-cr...
127
kubernetes JSONpath fails to return keys containing dots in a map
Escaping dots works now To revisit the example in my original message: Closing ๐ŸŽ‰ ...
127
moby docker daemon unable to access registry - Client.Timeout exceeded while awaiting headers
I found out that the problem might be in /etc/resolv.conf I had: but moving the non-working (yet) 10...
123
kubernetes "Failed to setup network for pod \ using network plugins \"cni\": no IP addresses available in network: podnet; Skipping pod"
I had a simliar issue while testing kubernetes with kubeadm This started to happen after I did a kub...
119
kubernetes Kubectl cp gives "tar: removing leading '/' from member names" warning
Something I found is that if I do not put a / at the beginning of my path following : in <pod>:<path...
114
moby docker.service Failed with result 'start-limit-hit'.
I had the same error message once because of an empty /etc/docker/daemon.json file Delete it if you ...
112
kubernetes no kind "Deployment" is registered for version "apps/v1beta2"
Small tip: To find out what exact apps/xxx api version your cluster supports use kubectl api-version...
111
minikube Can't pull images from an insecure registry in Minikube VM
I just tried this with minikube v0.10.0 and --insecure-registry='docker-registry.example.com:443' wa...