Troubleshooting Kubernetes DNS and Service Networking
It's always DNS. Here's a systematic way to debug Kubernetes service discovery and networking failures, from CoreDNS to kube-proxy, with AI to read the evidence.
- #kubernetes
- #dns
- #networking
- #coredns
- #ai
- #troubleshooting
There’s an old SRE joke that whatever the outage is, it’s always DNS. In Kubernetes it’s true more often than anywhere else, because service discovery is DNS — every time a pod talks to a Service by name, CoreDNS resolves it. When that breaks, everything breaks at once, in ways that look like a hundred different problems.
Here’s the systematic way I debug Kubernetes DNS and service networking, so you stop guessing and start eliminating.
Know the name resolution path first
When a pod resolves api.payments.svc.cluster.local, here’s what happens:
- The pod’s
/etc/resolv.confpoints at the cluster DNS Service IP (usually10.96.0.10). - That Service routes to a CoreDNS pod.
- CoreDNS answers cluster names from its plugin chain and forwards everything else upstream.
- The returned ClusterIP is then handled by kube-proxy (iptables/IPVS) or your CNI, which load-balances to a backend pod.
Two completely different systems — DNS resolution and Service routing — and “I can’t reach the service” could be either. Your first job is figuring out which.
Split the problem: name vs. route
Exec into a pod and test both halves separately:
# Half 1: does the name resolve?
kubectl exec -it mypod -- nslookup api.payments.svc.cluster.local
# Half 2: does the IP route to a backend?
kubectl exec -it mypod -- nc -zv 10.96.45.12 80
If resolution fails, it’s a DNS problem. If resolution works but the connection hangs or refuses, it’s a Service/endpoint/network problem. This single split saves you from chasing CoreDNS when the real issue is an empty endpoints list.
DNS side: check CoreDNS
If names won’t resolve:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
Look for CrashLoops or SERVFAIL. The classic CoreDNS failures:
- CoreDNS pods crashing — often an OOMKill under high query load, or a bad
Corefileedit. - Upstream loop —
Loop ... detectedin the logs means CoreDNS is forwarding to itself, usually a noderesolv.confpointing back at the cluster. - NXDOMAIN for external names — the
forwardplugin’s upstream is unreachable.
Also check resolv.conf inside the pod. A wrong ndots:5 setting causes a flood of failed lookups before the real one — a notorious latency bug in chatty services.
Service side: it’s almost always endpoints
If the name resolves but traffic doesn’t flow, check endpoints first, every time:
kubectl get endpoints api -n payments
Empty endpoints is the number-one cause of “the service is down.” It means no ready pods match the Service selector. Causes:
- Selector labels don’t match the pod labels (a typo).
- Pods are running but failing their readiness probe, so they’re excluded from endpoints.
targetPortpoints at a port the container doesn’t listen on.
That second one is sneaky: the pods are up, kubectl get pods looks green, but a failing readiness probe quietly keeps them out of rotation. Always check kubectl get endpoints, not just pod status.
Where AI helps: correlate the layers
DNS and networking failures span resolv.conf, CoreDNS logs, the Service, endpoints, and kube-proxy — and the bug is a mismatch between layers. Gather the evidence and ask:
“A pod can’t reach the ‘api’ service. Here’s the nslookup output, the pod’s resolv.conf, the Service and its endpoints, and the CoreDNS logs. Tell me whether this is a resolution failure or a routing failure, and find the specific mismatch.”
That first classification — resolution vs. routing — is the fork that determines everything you do next, and AI nails it from the evidence quickly. Keep a few Kubernetes networking prompts handy. The model reads five logs in parallel; you decide the fix.
The deeper layers, when basics check out
If DNS resolves, endpoints are populated, and it still won’t connect:
- Network Policy — a default-deny with no DNS egress rule breaks resolution cluster-wide; a missing app-to-app allow breaks routing. Check for policies selecting either pod.
- kube-proxy — if it’s down or mis-synced on a node, ClusterIPs don’t route.
kubectl logs -n kube-system -l k8s-app=kube-proxy. - MTU mismatch — overlay networks (VXLAN) need a lower MTU; a mismatch lets small packets through and silently drops large ones, so handshakes work but data transfer hangs. Maddening and real.
- CNI plugin — pod-to-pod across nodes failing entirely points at the CNI, not the Service.
A DNS debugging runbook
The order I work, every time:
- Reproduce from a pod, not your laptop — cluster DNS only works in-cluster.
- Split: resolve, then route.
nslookup, thenncto the ClusterIP. - DNS fails? CoreDNS pods, logs, pod
resolv.conf. - Resolves but no traffic?
kubectl get endpointsfirst — selector and readiness. - Endpoints fine? Network Policy, then kube-proxy, then CNI/MTU.
- Change one thing, re-test the same way.
Before networking or DNS config changes ship, I run them through the Code Review tool — it catches Service targetPort mismatches and Network Policies missing the DNS egress rule that quietly breaks the whole namespace.
It really is always DNS — until it’s an empty endpoints list pretending to be DNS. Split resolution from routing, check endpoints before you blame CoreDNS, and let AI correlate the logs while you keep a disciplined, one-change-at-a-time hand on the cluster.
AI network diagnoses are assistive. Always confirm with in-cluster tests before changing DNS or Service configuration.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.