Kubernetes Error Guide: 'SERVFAIL' from CoreDNS Resolution Failure
Fix CoreDNS SERVFAIL in Kubernetes: broken upstream resolvers, the loop plugin, and forward misconfiguration. Distinct from NXDOMAIN name-not-found errors.
- #kubernetes-helm
- #troubleshooting
- #errors
- #dns
Exact Error Message
When CoreDNS cannot answer a query because of a failure (not because the name does not exist), it returns SERVFAIL. Applications and dig surface it like this:
$ dig +short api.example.com
;; communications error to 10.96.0.10#53: SERVFAIL
$ nslookup api.example.com
;; Got SERVFAIL reply from 10.96.0.10, trying next server
server can't find api.example.com: SERVFAIL
In application logs it appears as resolver failures:
dial tcp: lookup api.example.com on 10.96.0.10:53: server misbehaving
CoreDNS itself often logs the upstream failure:
[ERROR] plugin/errors: 2 api.example.com. A: read udp 10.244.0.5:43210->8.8.8.8:53: i/o timeout
SERVFAIL is a server failure response code (RCODE 2), distinct from NXDOMAIN (RCODE 3, “this name definitively does not exist”).
What the Error Means
DNS responses carry a result code. NXDOMAIN is an authoritative “no such name” — resolution worked, the name simply does not exist. SERVFAIL is different: the resolver tried to answer and failed — it could not reach an upstream, hit a timeout, detected a loop, or got a broken response. The name might be perfectly valid; CoreDNS just could not complete the lookup.
In a cluster, pods send queries to the CoreDNS Service (usually 10.96.0.10 / kube-dns). CoreDNS resolves in-cluster names (*.svc.cluster.local) itself, and forwards external names to upstream resolvers (from the node’s /etc/resolv.conf or an explicit forward block). A SERVFAIL for an external name almost always means the forward path is broken — the upstream is down, unreachable, slow, or CoreDNS is configured to forward to itself and detects a loop. For in-cluster names, SERVFAIL points at CoreDNS health, plugin errors, or backend (kube-apiserver) problems. The key triage step is whether the failing name is internal or external.
Common Causes
- Upstream resolver down/unreachable — the forwarder (node DNS,
8.8.8.8, corporate DNS) is unreachable or timing out. - Forward loop — CoreDNS forwards to a resolver that points back at CoreDNS; the
loopplugin aborts with SERVFAIL (or CoreDNS crashloops at startup). - Bad
forwardconfig — the Corefileforward . <addr>targets a wrong or dead address. - Node
/etc/resolv.confbroken — CoreDNS inherits a bad upstream from the node when usingforward . /etc/resolv.conf. - CoreDNS overloaded/crashlooping — insufficient replicas or memory under query load returns SERVFAIL.
- NetworkPolicy/firewall — egress to upstream DNS (UDP/TCP 53) is blocked.
- DNSSEC/validation failure — a broken DNSSEC chain upstream yields SERVFAIL.
How to Reproduce the Error
Point CoreDNS at a dead upstream resolver and query an external name:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa
forward . 192.0.2.1 # TEST-NET, no DNS server here
cache 30
loop
reload
}
kubectl -n kube-system rollout restart deployment coredns
kubectl run dnstest --rm -it --image=busybox -- nslookup example.com
Server: 10.96.0.10
;; connection timed out; no servers could be reached
*** Can't find example.com: No answer
Forwarding to a black-holed address (192.0.2.1) makes every external lookup time out and return SERVFAIL.
Diagnostic Commands
# Confirm SERVFAIL vs NXDOMAIN for the failing name from inside the cluster
kubectl run dnstest --rm -it --image=tutum/dnsutils -- dig api.example.com
# Distinguish internal vs external: in-cluster name should resolve
kubectl run dnstest --rm -it --image=tutum/dnsutils -- dig kubernetes.default.svc.cluster.local
# Inspect the active CoreDNS Corefile (forward target, loop, cache)
kubectl -n kube-system get configmap coredns -o jsonpath='{.data.Corefile}'
# Read CoreDNS logs for upstream errors and loop detection
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=100 | grep -i 'error\|loop\|timeout'
# Are CoreDNS pods healthy and is the Service backed by endpoints?
kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system get endpoints kube-dns
# What upstream does the node hand to CoreDNS?
kubectl -n kube-system logs -l k8s-app=kube-dns | grep -i resolv
Comparing an internal lookup (should succeed) against the external one that SERVFAILs immediately isolates the broken forward path.
Step-by-Step Resolution
1. Confirm it is SERVFAIL, not NXDOMAIN. Run dig <name> from a pod and read the status: line. SERVFAIL = resolver failure (this guide). NXDOMAIN = the name does not exist (fix the name/Service, not CoreDNS).
2. Isolate internal vs external. Query an in-cluster name (kubernetes.default.svc.cluster.local) and an external one. If internal works but external SERVFAILs, the forward/upstream path is broken. If both fail, CoreDNS itself is unhealthy.
3. Check the forward target. Read the Corefile’s forward . line. Verify the upstream address is reachable on UDP/TCP 53. If it forwards to /etc/resolv.conf, inspect the node’s resolver. Replace a dead upstream with a known-good resolver and kubectl -n kube-system rollout restart deployment coredns.
4. Rule out a forward loop. If CoreDNS logs Loop ... detected or crashloops at startup, it is forwarding to a resolver that points back at CoreDNS (common with forward . /etc/resolv.conf when the node uses a local stub). Point the forward at an explicit upstream (e.g., the real corporate DNS) instead of the loopback stub. Keep the loop plugin enabled — it is protecting you.
5. Verify CoreDNS health and capacity. Ensure pods are Running, kube-dns has endpoints, and there are enough replicas for the query load. Scale CoreDNS or raise memory if it is OOMKilled or throttled.
6. Check egress policy. A NetworkPolicy or node firewall blocking UDP/TCP 53 to the upstream produces SERVFAIL. Allow DNS egress from the CoreDNS pods to the upstream resolvers.
Prevention and Best Practices
- Forward to explicit, reachable upstream resolvers rather than relying solely on
/etc/resolv.confto avoid loop and inheritance surprises. - Keep the
loopplugin enabled; it converts silent recursion into a fast, visible failure. - Run at least two CoreDNS replicas with a PodDisruptionBudget and right-sized memory so query spikes do not cause SERVFAIL.
- Enable the
cacheplugin to absorb upstream blips and reduce load on forwarders. - Monitor CoreDNS metrics (
coredns_dns_responses_total{rcode="SERVFAIL"}) and alert on rising SERVFAIL rates. - Allow DNS egress (UDP/TCP 53) explicitly in NetworkPolicies that default-deny. More patterns in the Kubernetes & Helm guides.
Related Errors
- No endpoints available for service — a Service with no backends, which can make CoreDNS itself unreachable.
- Connection refused (pod-to-pod / service) — what an app hits after a name resolves but the target is not listening.
- Context deadline exceeded — the timeout pattern slow DNS forwarding triggers.
Frequently Asked Questions
What is the difference between SERVFAIL and NXDOMAIN? NXDOMAIN is authoritative: the name definitively does not exist, and resolution itself succeeded. SERVFAIL means resolution failed — an upstream was unreachable, timed out, looped, or returned a broken answer. A valid name can SERVFAIL when the forward path is broken.
Internal names resolve but external ones SERVFAIL — what is wrong? CoreDNS resolves in-cluster names itself but forwards external ones. A working internal lookup with failing external lookups isolates the problem to the forward path: a dead upstream, a blocked egress, or a misconfigured forward target.
CoreDNS crashloops with Loop ... detected at startup. What does that mean? The loop plugin found that CoreDNS forwards to a resolver that ultimately points back at CoreDNS (often a local stub in the node’s /etc/resolv.conf). Point the forward at an explicit upstream resolver instead of the loopback stub.
Should I disable the loop plugin to stop the crash? No. The loop is a real misconfiguration that would otherwise cause infinite recursion and SERVFAILs. Fix the forward target; the loop plugin is correctly preventing a worse failure.
Can a NetworkPolicy cause SERVFAIL? Yes. If a default-deny policy blocks the CoreDNS pods’ egress on UDP/TCP 53 to the upstream resolvers, forwarded queries time out and return SERVFAIL. Add an explicit DNS egress allow rule.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.