Troubleshooting Kubernetes DNS and Service Networking

There’s an old SRE joke that whatever the outage is, it’s always DNS. In Kubernetes it’s true more often than anywhere else, because service discovery is DNS — every time a pod talks to a Service by name, CoreDNS resolves it. When that breaks, everything breaks at once, in ways that look like a hundred different problems.

Here’s the systematic way I debug Kubernetes DNS and service networking, so you stop guessing and start eliminating.

Know the name resolution path first

When a pod resolves api.payments.svc.cluster.local, here’s what happens:

The pod’s /etc/resolv.conf points at the cluster DNS Service IP (usually 10.96.0.10).
That Service routes to a CoreDNS pod.
CoreDNS answers cluster names from its plugin chain and forwards everything else upstream.
The returned ClusterIP is then handled by kube-proxy (iptables/IPVS) or your CNI, which load-balances to a backend pod.

Two completely different systems — DNS resolution and Service routing — and “I can’t reach the service” could be either. Your first job is figuring out which.

Split the problem: name vs. route

Exec into a pod and test both halves separately:

# Half 1: does the name resolve?
kubectl exec -it mypod -- nslookup api.payments.svc.cluster.local

# Half 2: does the IP route to a backend?
kubectl exec -it mypod -- nc -zv 10.96.45.12 80

If resolution fails, it’s a DNS problem. If resolution works but the connection hangs or refuses, it’s a Service/endpoint/network problem. This single split saves you from chasing CoreDNS when the real issue is an empty endpoints list.

DNS side: check CoreDNS

If names won’t resolve:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Look for CrashLoops or SERVFAIL. The classic CoreDNS failures:

CoreDNS pods crashing — often an OOMKill under high query load, or a bad Corefile edit.
Upstream loop — Loop ... detected in the logs means CoreDNS is forwarding to itself, usually a node resolv.conf pointing back at the cluster.
NXDOMAIN for external names — the forward plugin’s upstream is unreachable.

Also check resolv.conf inside the pod. A wrong ndots:5 setting causes a flood of failed lookups before the real one — a notorious latency bug in chatty services.

Service side: it’s almost always endpoints

If the name resolves but traffic doesn’t flow, check endpoints first, every time:

kubectl get endpoints api -n payments

Empty endpoints is the number-one cause of “the service is down.” It means no ready pods match the Service selector. Causes:

Selector labels don’t match the pod labels (a typo).
Pods are running but failing their readiness probe, so they’re excluded from endpoints.
targetPort points at a port the container doesn’t listen on.

That second one is sneaky: the pods are up, kubectl get pods looks green, but a failing readiness probe quietly keeps them out of rotation. Always check kubectl get endpoints, not just pod status.

Where AI helps: correlate the layers

DNS and networking failures span resolv.conf, CoreDNS logs, the Service, endpoints, and kube-proxy — and the bug is a mismatch between layers. Gather the evidence and ask:

“A pod can’t reach the ‘api’ service. Here’s the nslookup output, the pod’s resolv.conf, the Service and its endpoints, and the CoreDNS logs. Tell me whether this is a resolution failure or a routing failure, and find the specific mismatch.”

That first classification — resolution vs. routing — is the fork that determines everything you do next, and AI nails it from the evidence quickly. Keep a few Kubernetes networking prompts handy. The model reads five logs in parallel; you decide the fix.

The deeper layers, when basics check out

If DNS resolves, endpoints are populated, and it still won’t connect:

Network Policy — a default-deny with no DNS egress rule breaks resolution cluster-wide; a missing app-to-app allow breaks routing. Check for policies selecting either pod.
kube-proxy — if it’s down or mis-synced on a node, ClusterIPs don’t route. kubectl logs -n kube-system -l k8s-app=kube-proxy.
MTU mismatch — overlay networks (VXLAN) need a lower MTU; a mismatch lets small packets through and silently drops large ones, so handshakes work but data transfer hangs. Maddening and real.
CNI plugin — pod-to-pod across nodes failing entirely points at the CNI, not the Service.

A DNS debugging runbook

The order I work, every time:

Reproduce from a pod, not your laptop — cluster DNS only works in-cluster.
Split: resolve, then route. nslookup, then nc to the ClusterIP.
DNS fails? CoreDNS pods, logs, pod resolv.conf.
Resolves but no traffic? kubectl get endpoints first — selector and readiness.
Endpoints fine? Network Policy, then kube-proxy, then CNI/MTU.
Change one thing, re-test the same way.

Before networking or DNS config changes ship, I run them through the Code Review tool — it catches Service targetPort mismatches and Network Policies missing the DNS egress rule that quietly breaks the whole namespace.

It really is always DNS — until it’s an empty endpoints list pretending to be DNS. Split resolution from routing, check endpoints before you blame CoreDNS, and let AI correlate the logs while you keep a disciplined, one-change-at-a-time hand on the cluster.

AI network diagnoses are assistive. Always confirm with in-cluster tests before changing DNS or Service configuration.