NodeLocal DNSCache Performance Prompt
Deploy and tune NodeLocal DNSCache to eliminate cluster DNS latency, conntrack races, and CoreDNS overload — with the right upstream config, cache TTLs, and rollout safety.
- Target user
- Platform engineers fixing slow or flaky in-cluster DNS
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has rolled out NodeLocal DNSCache across large clusters to kill DNS tail latency and the infamous 5-second timeout. I will provide: - Symptoms (intermittent DNS timeouts, ~5s latency spikes, CoreDNS CPU spikes, `i/o timeout` on resolves) - Current DNS setup (CoreDNS replicas, kube-proxy mode iptables vs IPVS, CNI) - Cluster scale (nodes, pods, QPS to kube-dns) - Whether NodeLocal DNSCache is already deployed - App profile (chatty clients, short-lived connections, external lookups) Work through this: 1. **Diagnose the real problem** — distinguish the classic conntrack DNAT race (UDP, the 5s timeout) from genuine CoreDNS overload from app-side misconfiguration (`ndots:5` search-domain explosion). Give the commands/metrics to tell them apart before changing anything. 2. **Why NodeLocal helps** — explain the per-node caching DaemonSet listening on a link-local IP, how it serves from cache and uses TCP to upstream CoreDNS, sidestepping the conntrack race and cutting cross-node hops. 3. **Architecture & config** — the DaemonSet, the link-local listen IP (e.g. `169.254.20.10`), the Corefile blocks (cluster.local zone, in-addr.arpa, the catch-all forward), and how kubelet's `--cluster-dns` or the dummy interface wires pods to it. 4. **Cache & TTL tuning** — `cache` block sizing, success/denial TTLs, `forward` to CoreDNS with `force_tcp`, and `prefetch` for hot records. 5. **App-side fixes** — reduce `ndots`, use FQDNs (trailing dot) for known externals, and `dnsConfig` overrides to cut search-domain fan-out that no cache fully fixes. 6. **Rollout safety** — DaemonSet rollout strategy, health checks, and the failure mode if the local cache dies (does traffic fall back to CoreDNS?). Plan a per-node-pool canary. 7. **Validation** — measure p99 resolve latency, conntrack insert_failed, CoreDNS QPS, and timeout rate before/after. Output as: (a) root-cause assessment for my symptoms, (b) the NodeLocal DNSCache manifest/Corefile tuned for my cluster, (c) the `dnsConfig`/ndots app-side recommendations, (d) a canary rollout plan with rollback, (e) the before/after metrics to capture. Be explicit about kube-proxy mode and CNI assumptions, since the wiring differs.