Kubernetes Error Guide: 'SERVFAIL' from CoreDNS Resolution

Exact Error Message

When CoreDNS cannot answer a query because of a failure (not because the name does not exist), it returns SERVFAIL. Applications and dig surface it like this:

$ dig +short api.example.com
;; communications error to 10.96.0.10#53: SERVFAIL

$ nslookup api.example.com
;; Got SERVFAIL reply from 10.96.0.10, trying next server
server can't find api.example.com: SERVFAIL

In application logs it appears as resolver failures:

dial tcp: lookup api.example.com on 10.96.0.10:53: server misbehaving

CoreDNS itself often logs the upstream failure:

[ERROR] plugin/errors: 2 api.example.com. A: read udp 10.244.0.5:43210->8.8.8.8:53: i/o timeout

SERVFAIL is a server failure response code (RCODE 2), distinct from NXDOMAIN (RCODE 3, “this name definitively does not exist”).

What the Error Means

DNS responses carry a result code. NXDOMAIN is an authoritative “no such name” — resolution worked, the name simply does not exist. SERVFAIL is different: the resolver tried to answer and failed — it could not reach an upstream, hit a timeout, detected a loop, or got a broken response. The name might be perfectly valid; CoreDNS just could not complete the lookup.

In a cluster, pods send queries to the CoreDNS Service (usually 10.96.0.10 / kube-dns). CoreDNS resolves in-cluster names (*.svc.cluster.local) itself, and forwards external names to upstream resolvers (from the node’s /etc/resolv.conf or an explicit forward block). A SERVFAIL for an external name almost always means the forward path is broken — the upstream is down, unreachable, slow, or CoreDNS is configured to forward to itself and detects a loop. For in-cluster names, SERVFAIL points at CoreDNS health, plugin errors, or backend (kube-apiserver) problems. The key triage step is whether the failing name is internal or external.

Common Causes

Upstream resolver down/unreachable — the forwarder (node DNS, 8.8.8.8, corporate DNS) is unreachable or timing out.
Forward loop — CoreDNS forwards to a resolver that points back at CoreDNS; the loop plugin aborts with SERVFAIL (or CoreDNS crashloops at startup).
Bad forward config — the Corefile forward . <addr> targets a wrong or dead address.
Node /etc/resolv.conf broken — CoreDNS inherits a bad upstream from the node when using forward . /etc/resolv.conf.
CoreDNS overloaded/crashlooping — insufficient replicas or memory under query load returns SERVFAIL.
NetworkPolicy/firewall — egress to upstream DNS (UDP/TCP 53) is blocked.
DNSSEC/validation failure — a broken DNSSEC chain upstream yields SERVFAIL.

How to Reproduce the Error

Point CoreDNS at a dead upstream resolver and query an external name:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa
        forward . 192.0.2.1   # TEST-NET, no DNS server here
        cache 30
        loop
        reload
    }

kubectl -n kube-system rollout restart deployment coredns
kubectl run dnstest --rm -it --image=busybox -- nslookup example.com

Server:    10.96.0.10
;; connection timed out; no servers could be reached
*** Can't find example.com: No answer

Forwarding to a black-holed address (192.0.2.1) makes every external lookup time out and return SERVFAIL.

Diagnostic Commands

# Confirm SERVFAIL vs NXDOMAIN for the failing name from inside the cluster
kubectl run dnstest --rm -it --image=tutum/dnsutils -- dig api.example.com

# Distinguish internal vs external: in-cluster name should resolve
kubectl run dnstest --rm -it --image=tutum/dnsutils -- dig kubernetes.default.svc.cluster.local

# Inspect the active CoreDNS Corefile (forward target, loop, cache)
kubectl -n kube-system get configmap coredns -o jsonpath='{.data.Corefile}'

# Read CoreDNS logs for upstream errors and loop detection
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=100 | grep -i 'error\|loop\|timeout'

# Are CoreDNS pods healthy and is the Service backed by endpoints?
kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system get endpoints kube-dns

# What upstream does the node hand to CoreDNS?
kubectl -n kube-system logs -l k8s-app=kube-dns | grep -i resolv

Comparing an internal lookup (should succeed) against the external one that SERVFAILs immediately isolates the broken forward path.

Step-by-Step Resolution

1. Confirm it is SERVFAIL, not NXDOMAIN. Run dig <name> from a pod and read the status: line. SERVFAIL = resolver failure (this guide). NXDOMAIN = the name does not exist (fix the name/Service, not CoreDNS).

2. Isolate internal vs external. Query an in-cluster name (kubernetes.default.svc.cluster.local) and an external one. If internal works but external SERVFAILs, the forward/upstream path is broken. If both fail, CoreDNS itself is unhealthy.

3. Check the forward target. Read the Corefile’s forward . line. Verify the upstream address is reachable on UDP/TCP 53. If it forwards to /etc/resolv.conf, inspect the node’s resolver. Replace a dead upstream with a known-good resolver and kubectl -n kube-system rollout restart deployment coredns.

4. Rule out a forward loop. If CoreDNS logs Loop ... detected or crashloops at startup, it is forwarding to a resolver that points back at CoreDNS (common with forward . /etc/resolv.conf when the node uses a local stub). Point the forward at an explicit upstream (e.g., the real corporate DNS) instead of the loopback stub. Keep the loop plugin enabled — it is protecting you.

5. Verify CoreDNS health and capacity. Ensure pods are Running, kube-dns has endpoints, and there are enough replicas for the query load. Scale CoreDNS or raise memory if it is OOMKilled or throttled.

6. Check egress policy. A NetworkPolicy or node firewall blocking UDP/TCP 53 to the upstream produces SERVFAIL. Allow DNS egress from the CoreDNS pods to the upstream resolvers.

Prevention and Best Practices

Forward to explicit, reachable upstream resolvers rather than relying solely on /etc/resolv.conf to avoid loop and inheritance surprises.
Keep the loop plugin enabled; it converts silent recursion into a fast, visible failure.
Run at least two CoreDNS replicas with a PodDisruptionBudget and right-sized memory so query spikes do not cause SERVFAIL.
Enable the cache plugin to absorb upstream blips and reduce load on forwarders.
Monitor CoreDNS metrics (coredns_dns_responses_total{rcode="SERVFAIL"}) and alert on rising SERVFAIL rates.
Allow DNS egress (UDP/TCP 53) explicitly in NetworkPolicies that default-deny. More patterns in the Kubernetes & Helm guides.

No endpoints available for service — a Service with no backends, which can make CoreDNS itself unreachable.
Connection refused (pod-to-pod / service) — what an app hits after a name resolves but the target is not listening.
Context deadline exceeded — the timeout pattern slow DNS forwarding triggers.

Frequently Asked Questions

What is the difference between SERVFAIL and NXDOMAIN? NXDOMAIN is authoritative: the name definitively does not exist, and resolution itself succeeded. SERVFAIL means resolution failed — an upstream was unreachable, timed out, looped, or returned a broken answer. A valid name can SERVFAIL when the forward path is broken.

Internal names resolve but external ones SERVFAIL — what is wrong? CoreDNS resolves in-cluster names itself but forwards external ones. A working internal lookup with failing external lookups isolates the problem to the forward path: a dead upstream, a blocked egress, or a misconfigured forward target.

CoreDNS crashloops with Loop ... detected at startup. What does that mean? The loop plugin found that CoreDNS forwards to a resolver that ultimately points back at CoreDNS (often a local stub in the node’s /etc/resolv.conf). Point the forward at an explicit upstream resolver instead of the loopback stub.

Should I disable the loop plugin to stop the crash? No. The loop is a real misconfiguration that would otherwise cause infinite recursion and SERVFAILs. Fix the forward target; the loop plugin is correctly preventing a worse failure.

Can a NetworkPolicy cause SERVFAIL? Yes. If a default-deny policy blocks the CoreDNS pods’ egress on UDP/TCP 53 to the upstream resolvers, forwarded queries time out and return SERVFAIL. Add an explicit DNS egress allow rule.

Kubernetes Error Guide: 'SERVFAIL' from CoreDNS Resolution Failure

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit