Skip to content
CloudOps
Newsletter
All guides
AI for Kubernetes & Helm By James Joyner IV · · 8 min read

Troubleshooting Kubernetes DNS and Service Networking

It's always DNS. Here's a systematic way to debug Kubernetes service discovery and networking failures, from CoreDNS to kube-proxy, with AI to read the evidence.

  • #kubernetes
  • #dns
  • #networking
  • #coredns
  • #ai
  • #troubleshooting

There’s an old SRE joke that whatever the outage is, it’s always DNS. In Kubernetes it’s true more often than anywhere else, because service discovery is DNS — every time a pod talks to a Service by name, CoreDNS resolves it. When that breaks, everything breaks at once, in ways that look like a hundred different problems.

Here’s the systematic way I debug Kubernetes DNS and service networking, so you stop guessing and start eliminating.

Know the name resolution path first

When a pod resolves api.payments.svc.cluster.local, here’s what happens:

  1. The pod’s /etc/resolv.conf points at the cluster DNS Service IP (usually 10.96.0.10).
  2. That Service routes to a CoreDNS pod.
  3. CoreDNS answers cluster names from its plugin chain and forwards everything else upstream.
  4. The returned ClusterIP is then handled by kube-proxy (iptables/IPVS) or your CNI, which load-balances to a backend pod.

Two completely different systems — DNS resolution and Service routing — and “I can’t reach the service” could be either. Your first job is figuring out which.

Split the problem: name vs. route

Exec into a pod and test both halves separately:

# Half 1: does the name resolve?
kubectl exec -it mypod -- nslookup api.payments.svc.cluster.local

# Half 2: does the IP route to a backend?
kubectl exec -it mypod -- nc -zv 10.96.45.12 80

If resolution fails, it’s a DNS problem. If resolution works but the connection hangs or refuses, it’s a Service/endpoint/network problem. This single split saves you from chasing CoreDNS when the real issue is an empty endpoints list.

DNS side: check CoreDNS

If names won’t resolve:

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Look for CrashLoops or SERVFAIL. The classic CoreDNS failures:

  • CoreDNS pods crashing — often an OOMKill under high query load, or a bad Corefile edit.
  • Upstream loopLoop ... detected in the logs means CoreDNS is forwarding to itself, usually a node resolv.conf pointing back at the cluster.
  • NXDOMAIN for external names — the forward plugin’s upstream is unreachable.

Also check resolv.conf inside the pod. A wrong ndots:5 setting causes a flood of failed lookups before the real one — a notorious latency bug in chatty services.

Service side: it’s almost always endpoints

If the name resolves but traffic doesn’t flow, check endpoints first, every time:

kubectl get endpoints api -n payments

Empty endpoints is the number-one cause of “the service is down.” It means no ready pods match the Service selector. Causes:

  • Selector labels don’t match the pod labels (a typo).
  • Pods are running but failing their readiness probe, so they’re excluded from endpoints.
  • targetPort points at a port the container doesn’t listen on.

That second one is sneaky: the pods are up, kubectl get pods looks green, but a failing readiness probe quietly keeps them out of rotation. Always check kubectl get endpoints, not just pod status.

Where AI helps: correlate the layers

DNS and networking failures span resolv.conf, CoreDNS logs, the Service, endpoints, and kube-proxy — and the bug is a mismatch between layers. Gather the evidence and ask:

“A pod can’t reach the ‘api’ service. Here’s the nslookup output, the pod’s resolv.conf, the Service and its endpoints, and the CoreDNS logs. Tell me whether this is a resolution failure or a routing failure, and find the specific mismatch.”

That first classification — resolution vs. routing — is the fork that determines everything you do next, and AI nails it from the evidence quickly. Keep a few Kubernetes networking prompts handy. The model reads five logs in parallel; you decide the fix.

The deeper layers, when basics check out

If DNS resolves, endpoints are populated, and it still won’t connect:

  • Network Policy — a default-deny with no DNS egress rule breaks resolution cluster-wide; a missing app-to-app allow breaks routing. Check for policies selecting either pod.
  • kube-proxy — if it’s down or mis-synced on a node, ClusterIPs don’t route. kubectl logs -n kube-system -l k8s-app=kube-proxy.
  • MTU mismatch — overlay networks (VXLAN) need a lower MTU; a mismatch lets small packets through and silently drops large ones, so handshakes work but data transfer hangs. Maddening and real.
  • CNI plugin — pod-to-pod across nodes failing entirely points at the CNI, not the Service.

A DNS debugging runbook

The order I work, every time:

  1. Reproduce from a pod, not your laptop — cluster DNS only works in-cluster.
  2. Split: resolve, then route. nslookup, then nc to the ClusterIP.
  3. DNS fails? CoreDNS pods, logs, pod resolv.conf.
  4. Resolves but no traffic? kubectl get endpoints first — selector and readiness.
  5. Endpoints fine? Network Policy, then kube-proxy, then CNI/MTU.
  6. Change one thing, re-test the same way.

Before networking or DNS config changes ship, I run them through the Code Review tool — it catches Service targetPort mismatches and Network Policies missing the DNS egress rule that quietly breaks the whole namespace.

It really is always DNS — until it’s an empty endpoints list pretending to be DNS. Split resolution from routing, check endpoints before you blame CoreDNS, and let AI correlate the logs while you keep a disciplined, one-change-at-a-time hand on the cluster.

AI network diagnoses are assistive. Always confirm with in-cluster tests before changing DNS or Service configuration.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.