Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kubernetes & Helm By James Joyner IV · · 9 min read

Kubernetes Error Guide: 'SERVFAIL' from CoreDNS Resolution Failure

Fix CoreDNS SERVFAIL in Kubernetes: broken upstream resolvers, the loop plugin, and forward misconfiguration. Distinct from NXDOMAIN name-not-found errors.

  • #kubernetes-helm
  • #troubleshooting
  • #errors
  • #dns

Exact Error Message

When CoreDNS cannot answer a query because of a failure (not because the name does not exist), it returns SERVFAIL. Applications and dig surface it like this:

$ dig +short api.example.com
;; communications error to 10.96.0.10#53: SERVFAIL

$ nslookup api.example.com
;; Got SERVFAIL reply from 10.96.0.10, trying next server
server can't find api.example.com: SERVFAIL

In application logs it appears as resolver failures:

dial tcp: lookup api.example.com on 10.96.0.10:53: server misbehaving

CoreDNS itself often logs the upstream failure:

[ERROR] plugin/errors: 2 api.example.com. A: read udp 10.244.0.5:43210->8.8.8.8:53: i/o timeout

SERVFAIL is a server failure response code (RCODE 2), distinct from NXDOMAIN (RCODE 3, “this name definitively does not exist”).

What the Error Means

DNS responses carry a result code. NXDOMAIN is an authoritative “no such name” — resolution worked, the name simply does not exist. SERVFAIL is different: the resolver tried to answer and failed — it could not reach an upstream, hit a timeout, detected a loop, or got a broken response. The name might be perfectly valid; CoreDNS just could not complete the lookup.

In a cluster, pods send queries to the CoreDNS Service (usually 10.96.0.10 / kube-dns). CoreDNS resolves in-cluster names (*.svc.cluster.local) itself, and forwards external names to upstream resolvers (from the node’s /etc/resolv.conf or an explicit forward block). A SERVFAIL for an external name almost always means the forward path is broken — the upstream is down, unreachable, slow, or CoreDNS is configured to forward to itself and detects a loop. For in-cluster names, SERVFAIL points at CoreDNS health, plugin errors, or backend (kube-apiserver) problems. The key triage step is whether the failing name is internal or external.

Common Causes

  • Upstream resolver down/unreachable — the forwarder (node DNS, 8.8.8.8, corporate DNS) is unreachable or timing out.
  • Forward loop — CoreDNS forwards to a resolver that points back at CoreDNS; the loop plugin aborts with SERVFAIL (or CoreDNS crashloops at startup).
  • Bad forward config — the Corefile forward . <addr> targets a wrong or dead address.
  • Node /etc/resolv.conf broken — CoreDNS inherits a bad upstream from the node when using forward . /etc/resolv.conf.
  • CoreDNS overloaded/crashlooping — insufficient replicas or memory under query load returns SERVFAIL.
  • NetworkPolicy/firewall — egress to upstream DNS (UDP/TCP 53) is blocked.
  • DNSSEC/validation failure — a broken DNSSEC chain upstream yields SERVFAIL.

How to Reproduce the Error

Point CoreDNS at a dead upstream resolver and query an external name:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa
        forward . 192.0.2.1   # TEST-NET, no DNS server here
        cache 30
        loop
        reload
    }
kubectl -n kube-system rollout restart deployment coredns
kubectl run dnstest --rm -it --image=busybox -- nslookup example.com
Server:    10.96.0.10
;; connection timed out; no servers could be reached
*** Can't find example.com: No answer

Forwarding to a black-holed address (192.0.2.1) makes every external lookup time out and return SERVFAIL.

Diagnostic Commands

# Confirm SERVFAIL vs NXDOMAIN for the failing name from inside the cluster
kubectl run dnstest --rm -it --image=tutum/dnsutils -- dig api.example.com

# Distinguish internal vs external: in-cluster name should resolve
kubectl run dnstest --rm -it --image=tutum/dnsutils -- dig kubernetes.default.svc.cluster.local

# Inspect the active CoreDNS Corefile (forward target, loop, cache)
kubectl -n kube-system get configmap coredns -o jsonpath='{.data.Corefile}'

# Read CoreDNS logs for upstream errors and loop detection
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=100 | grep -i 'error\|loop\|timeout'

# Are CoreDNS pods healthy and is the Service backed by endpoints?
kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system get endpoints kube-dns

# What upstream does the node hand to CoreDNS?
kubectl -n kube-system logs -l k8s-app=kube-dns | grep -i resolv

Comparing an internal lookup (should succeed) against the external one that SERVFAILs immediately isolates the broken forward path.

Step-by-Step Resolution

1. Confirm it is SERVFAIL, not NXDOMAIN. Run dig <name> from a pod and read the status: line. SERVFAIL = resolver failure (this guide). NXDOMAIN = the name does not exist (fix the name/Service, not CoreDNS).

2. Isolate internal vs external. Query an in-cluster name (kubernetes.default.svc.cluster.local) and an external one. If internal works but external SERVFAILs, the forward/upstream path is broken. If both fail, CoreDNS itself is unhealthy.

3. Check the forward target. Read the Corefile’s forward . line. Verify the upstream address is reachable on UDP/TCP 53. If it forwards to /etc/resolv.conf, inspect the node’s resolver. Replace a dead upstream with a known-good resolver and kubectl -n kube-system rollout restart deployment coredns.

4. Rule out a forward loop. If CoreDNS logs Loop ... detected or crashloops at startup, it is forwarding to a resolver that points back at CoreDNS (common with forward . /etc/resolv.conf when the node uses a local stub). Point the forward at an explicit upstream (e.g., the real corporate DNS) instead of the loopback stub. Keep the loop plugin enabled — it is protecting you.

5. Verify CoreDNS health and capacity. Ensure pods are Running, kube-dns has endpoints, and there are enough replicas for the query load. Scale CoreDNS or raise memory if it is OOMKilled or throttled.

6. Check egress policy. A NetworkPolicy or node firewall blocking UDP/TCP 53 to the upstream produces SERVFAIL. Allow DNS egress from the CoreDNS pods to the upstream resolvers.

Prevention and Best Practices

  • Forward to explicit, reachable upstream resolvers rather than relying solely on /etc/resolv.conf to avoid loop and inheritance surprises.
  • Keep the loop plugin enabled; it converts silent recursion into a fast, visible failure.
  • Run at least two CoreDNS replicas with a PodDisruptionBudget and right-sized memory so query spikes do not cause SERVFAIL.
  • Enable the cache plugin to absorb upstream blips and reduce load on forwarders.
  • Monitor CoreDNS metrics (coredns_dns_responses_total{rcode="SERVFAIL"}) and alert on rising SERVFAIL rates.
  • Allow DNS egress (UDP/TCP 53) explicitly in NetworkPolicies that default-deny. More patterns in the Kubernetes & Helm guides.

Frequently Asked Questions

What is the difference between SERVFAIL and NXDOMAIN? NXDOMAIN is authoritative: the name definitively does not exist, and resolution itself succeeded. SERVFAIL means resolution failed — an upstream was unreachable, timed out, looped, or returned a broken answer. A valid name can SERVFAIL when the forward path is broken.

Internal names resolve but external ones SERVFAIL — what is wrong? CoreDNS resolves in-cluster names itself but forwards external ones. A working internal lookup with failing external lookups isolates the problem to the forward path: a dead upstream, a blocked egress, or a misconfigured forward target.

CoreDNS crashloops with Loop ... detected at startup. What does that mean? The loop plugin found that CoreDNS forwards to a resolver that ultimately points back at CoreDNS (often a local stub in the node’s /etc/resolv.conf). Point the forward at an explicit upstream resolver instead of the loopback stub.

Should I disable the loop plugin to stop the crash? No. The loop is a real misconfiguration that would otherwise cause infinite recursion and SERVFAILs. Fix the forward target; the loop plugin is correctly preventing a worse failure.

Can a NetworkPolicy cause SERVFAIL? Yes. If a default-deny policy blocks the CoreDNS pods’ egress on UDP/TCP 53 to the upstream resolvers, forwarded queries time out and return SERVFAIL. Add an explicit DNS egress allow rule.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.