Kubernetes CoreDNS Debugging Prompt
Diagnose Kubernetes DNS issues — CoreDNS not resolving, ndots traps, search domain explosion, NXDOMAIN floods, conntrack DNS races.
- Target user
- Kubernetes engineers debugging in-cluster DNS
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who has debugged DNS issues across CoreDNS configurations — `ndots` traps, search domain explosion, cache misses, NXDOMAIN floods. You know that "DNS slow" in K8s is usually 5 sequential lookups, not a slow upstream.
I will provide:
- The symptom (resolution fails, slow lookups, intermittent NXDOMAIN, lookups for `<svc>.cluster.local.<something>.cluster.local`)
- `kubectl -n kube-system get pods -l k8s-app=kube-dns` and `kubectl -n kube-system get cm coredns -o yaml`
- A reproduction from a pod: `kubectl exec <pod> -- nslookup <name>` or `dig <name>`
- The pod's `/etc/resolv.conf` content
- CoreDNS logs: `kubectl -n kube-system logs -l k8s-app=kube-dns --tail=200`
- Pod's `dnsPolicy` (`ClusterFirst`, `Default`, `ClusterFirstWithHostNet`, `None`)
Your job:
1. **Understand the lookup mechanics**:
- Pod's `/etc/resolv.conf` typically has:
```
nameserver 10.96.0.10 # CoreDNS Service ClusterIP
search <ns>.svc.cluster.local svc.cluster.local cluster.local <node-search>
options ndots:5
```
- **`ndots:5`** means: if the name has fewer than 5 dots, append each search domain and try
- **Looking up `google.com`** with `ndots:5` → 4 NXDOMAIN attempts (`google.com.foo.svc.cluster.local`, `google.com.svc.cluster.local`, `google.com.cluster.local`, then `google.com`)
- Slow external lookups in K8s are usually this. Either lower `ndots` (to 2 or 1) or use FQDN (`google.com.`)
2. **Common failure modes**:
- **CoreDNS pods CrashLoopBackOff** → entire cluster DNS down; check pod logs
- **CoreDNS pods OK but resolution fails** → pod's nameserver wrong, or network policy blocking
- **Intermittent failures** → conntrack race on UDP DNS (well-known issue); fix with TCP DNS or NodeLocal DNS
- **NXDOMAIN floods on external names** → ndots:5 + search domains; reduce ndots
- **Slow startup** for apps doing many DNS lookups → search domain explosion
3. **For "service can't reach service"**:
- In-cluster name: `<svc>` (in same NS), `<svc>.<ns>`, or `<svc>.<ns>.svc.cluster.local`
- Verify service exists: `kubectl get svc <svc> -n <ns>`
- Verify endpoints (`kubectl get endpoints <svc>`) — empty means selector mismatch
4. **For NodeLocal DNS Cache** (recommended):
- DaemonSet `node-local-dns` runs on each node
- Pod's `/etc/resolv.conf` points to local DNS (e.g., `169.254.20.10`)
- Eliminates UDP-conntrack races; caches negative responses
- Configured via `--cluster-dns` on kubelet for new nodes
5. **CoreDNS Corefile** common config:
```
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf {
max_concurrent 1000
}
cache 30
loop
reload
loadbalance
}
```
- Tune `cache` (default 30s) for hot lookups
- `forward` defines upstream; can be `8.8.8.8` instead of `/etc/resolv.conf`
6. **For `dnsPolicy: ClusterFirstWithHostNet`** — required for pods with `hostNetwork: true` to use CoreDNS instead of node's DNS
7. **For `dnsConfig: nameservers:` override** — pod-level DNS config; useful for specific apps
Mark DESTRUCTIVE: editing CoreDNS Corefile without test (whole-cluster DNS outage if broken), deleting CoreDNS pods without ensuring others are healthy.
---
Symptom: [DESCRIBE]
Pod's `/etc/resolv.conf`:
```
[PASTE — from kubectl exec <pod> -- cat /etc/resolv.conf]
```
Reproduction (nslookup output):
```
[PASTE]
```
CoreDNS pod status and logs:
```
[PASTE]
```
CoreDNS Corefile:
```
[PASTE `kubectl -n kube-system get cm coredns -o yaml`]
```
Pod's `dnsPolicy` + `dnsConfig`:
```yaml
[PASTE]
```
Why this prompt works
DNS issues in K8s have specific cluster-side causes (CoreDNS, ndots, search domains) layered on top of normal DNS. Models tend to chase “the DNS server is slow” when the issue is the resolver doing 5 lookups per name. This prompt walks the resolver path.
How to use it
- Always include the pod’s
/etc/resolv.conf— it reveals ndots, search, nameserver. - Test with
dig +shortfor clean output; withdig +tracefor debugging. - Test BOTH in-cluster name and external name — different failure paths.
- Check CoreDNS logs, not just the application’s logs.
Useful commands
# In-pod diagnostics
kubectl exec -n <ns> <pod> -- cat /etc/resolv.conf
kubectl exec -n <ns> <pod> -- nslookup kubernetes.default
kubectl exec -n <ns> <pod> -- nslookup kubernetes.default 10.96.0.10
kubectl exec -n <ns> <pod> -- nslookup google.com
kubectl exec -n <ns> <pod> -- getent hosts <name>
# Debug pod with dig
kubectl run dns-test --rm -it --image=nicolaka/netshoot --restart=Never -- bash
# Inside:
# dig +short kubernetes.default.svc.cluster.local
# dig +trace google.com
# nslookup -type=any kubernetes.default
# CoreDNS health
kubectl -n kube-system get pods -l k8s-app=kube-dns
kubectl -n kube-system describe pods -l k8s-app=kube-dns
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=200
kubectl -n kube-system get cm coredns -o yaml
# CoreDNS metrics (Prometheus endpoint :9153)
kubectl -n kube-system port-forward <coredns-pod> 9153:9153
curl localhost:9153/metrics | grep coredns
# Service / endpoint
kubectl get svc <svc> -n <ns>
kubectl get endpoints <svc> -n <ns>
# NodeLocal DNS
kubectl -n kube-system get pods -l k8s-app=node-local-dns
kubectl -n kube-system get cm node-local-dns -o yaml
# Conntrack (DNS UDP issues)
kubectl exec -n <ns> <pod> -- conntrack -L | grep ":53"
# (requires conntrack tools in container; usually need debug pod)
Resolution test pattern
# Comprehensive DNS test
kubectl run dnsdebug --rm -it --image=nicolaka/netshoot --restart=Never -- bash -c '
echo "--- resolv.conf ---"
cat /etc/resolv.conf
echo "--- in-cluster (short name) ---"
nslookup kubernetes
echo "--- in-cluster (FQDN) ---"
nslookup kubernetes.default.svc.cluster.local
echo "--- external (with search expansion) ---"
nslookup google.com
echo "--- external (FQDN, no expansion) ---"
nslookup google.com.
echo "--- direct to CoreDNS Service ---"
dig @10.96.0.10 +short kubernetes.default.svc.cluster.local
'
Pod-level DNS tweak (for slow startup apps)
spec:
dnsConfig:
options:
- name: ndots
value: "2" # lower than default 5
dnsPolicy: ClusterFirst # default
NodeLocal DNS deployment (recommended)
# Pod uses local DNS cache at 169.254.20.10 instead of CoreDNS Service IP
# Eliminates UDP conntrack races + caches all responses including NXDOMAIN
# Deployed as DaemonSet in kube-system
# Kubelet args (on each node):
# --cluster-dns=169.254.20.10
Common findings this catches
- External name lookups slow → ndots:5 + 4 search domains; use FQDN (
google.com.) or lower ndots. <svc>.cluster.local.svc.cluster.localin logs → app appended extra.cluster.local; not search-domain-aware.- NXDOMAIN flood for
<svc>in different namespace → using short name; needs<svc>.<ns>form. - Intermittent failures under load → conntrack UDP race; install NodeLocal DNS.
- CoreDNS forwarding loop detected in logs → upstream pointing back to CoreDNS; check
forwardconfig. - CoreDNS pod replicas too few for cluster size → scale up; aim for 2 per 1000 pods baseline.
- No CoreDNS Service or wrong ClusterIP → kubelet
--cluster-dnsmismatch.
When to escalate
- Cluster-wide DNS outage with CoreDNS healthy — upstream issue (node DNS, cloud VPC); engage networking.
- Suspected CoreDNS plugin bug — file upstream, capture logs and Corefile.
- Conntrack DNS race issues persist after NodeLocal DNS deployment — kernel/conntrack tuning; pull in platform team.
Related prompts
-
Kubernetes LoadBalancer / NodePort Service Debug Prompt
Diagnose LoadBalancer service issues — stuck Pending, externalTrafficPolicy: Local pitfalls, source IP preservation, cloud provider quirks, NodePort range collisions.
-
Kubernetes NetworkPolicy Debug Prompt
Diagnose why pod-to-pod, pod-to-service, or pod-to-external traffic is being dropped by NetworkPolicy — Calico, Cilium, Weave, or upstream defaults.
-
Linux Host Network Connectivity Debug Prompt
Diagnose single-host Linux networking — broken routes, firewall blocks, DNS, conntrack exhaustion, ephemeral port exhaustion, MTU issues — without confusing it with cloud/SDN problems.