Diagnosing DNS Incidents: When It Really Is Always DNS

The page says connection refused across three services, the dashboards are mostly green, and someone in the war room types the inevitable: “it’s probably DNS.” They’re often right, and that’s exactly the problem. “It’s always DNS” is a meme because the failure can live in any of four layers, each with a different fix, and the one thing that reliably makes a DNS incident worse is editing a record before you know which layer is broken.

This guide is about the discipline of localizing a DNS failure under pressure — finding the broken layer first, then fixing it — rather than guessing at records and waiting out TTLs to find out you were wrong.

The four layers, and why the symptom hides the cause

Every name lookup passes through layers that fail independently:

The stub/client resolver — the host’s own /etc/resolv.conf, container DNS, or a misconfigured search domain. When this breaks, some hosts fail and others don’t.
The recursive/caching resolver — your internal resolver, CoreDNS, or the cloud VPC resolver. When this breaks, an entire network segment fails at once.
The authoritative zone — the nameservers that hold the actual records. When this breaks, everyone fails to resolve a specific zone.
Propagation and TTL — a record was changed but old values are still cached. Failures are inconsistent: some resolvers see the new record, some the old.

The user-facing symptom — “service unreachable” — is identical across all four. So the first move is never “fix the record.” It’s “which layer?”

Localize before you touch anything

Two queries usually pin the layer. First, ask a public resolver and your internal resolver the same question and compare:

dig +short api.internal.example.com @1.1.1.1
dig +short api.internal.example.com @10.0.0.2   # your resolver

If the public resolver answers correctly and yours doesn’t, the problem is your recursive resolver or its config — not the authoritative zone. Next, query the authoritative nameservers directly, bypassing all caching:

# find the authoritative servers, then query one directly
dig +short NS example.com
dig +norecurse api.internal.example.com @ns1.example.com

If the authoritative server returns the right record but resolvers in the wild don’t, you’re looking at propagation or a caching problem, and the fix is patience plus TTL management, not another record edit. If the authoritative server itself is wrong or unreachable, you’ve found it — and now you also know the blast radius is everyone, not a subset.

Always check the basics in parallel: resolvectl status for the resolver chain, the record’s notAfter-equivalent (registrar expiry, DNSSEC signature validity), and clock skew, which quietly breaks DNSSEC validation.

Let AI structure the triage, not make the change

This is a place where an assistant earns its keep, because the diagnostic tree is wide and an exhausted on-call skips steps. Paste the symptom, the affected scope, recent changes, and your dig output, and ask the model to localize the layer and propose ordered, read-only checks. Keep the change authority human — TTLs make DNS edits slow to reverse.

Prompt: “We have intermittent resolution failures for api.internal.example.com — some pods resolve it, some get NXDOMAIN. Here’s dig from two resolvers and our recent change log. Which layer is most likely broken, what read-only checks confirm it, and what’s the safest mitigation with its rollback?”

Response (abridged): “Inconsistent NXDOMAIN across pods with a recent record edit points to a propagation/caching split, not authoritative failure — confirm by querying the authoritative NS directly (dig +norecurse ... @ns1); if it returns the correct record, the issue is stale caches expiring at different times. Safest mitigation: lower the TTL going forward and let caches expire; avoid further edits. A hosts-file pin is an emergency stopgap only — track it for reversal.”

That output is a checklist you run, not a change you apply. The DNS resolution failure live diagnosis prompt packages this with the exact commands and an AI-proposes-human-decides guardrail built in.

Mitigations, ranked by how hard they are to undo

Once you know the layer, choose the mitigation with reversal cost in mind:

Correct the record — right fix for authoritative errors, but slow to take effect if the TTL is high. Lower the TTL before the change when you can.
Resolver failover — repoint hosts to a working resolver for recursive-layer failures; fast and reversible.
Hosts-file or in-cluster pin — a genuine emergency stopgap when nothing else is fast enough. It is not a fix. Stale pins cause their own outages weeks later, so track every one as an open loop for reversal.

The reason DNS incidents drag on is rarely the fix itself — it’s that the wrong layer got “fixed” and the team waited out a TTL to learn it didn’t help. Localize first, and the actual change is usually small and obvious.

Confirm recovery from more than one vantage point

A single successful dig from your laptop is not recovery. Because the failure was often inconsistent across resolvers and regions, verify from several:

for r in 1.1.1.1 8.8.8.8 10.0.0.2; do
  echo "$r:"; dig +short api.internal.example.com @$r
done

Only when every relevant resolver and region returns the correct record — and any emergency pins are noted for removal — is the incident actually over.

Where this fits

DNS sits underneath nearly every service, so a clean DNS triage habit pays off across your whole incident response practice. Pair this with blast-radius mapping so you know what a failing zone takes down, and route live diagnosis through your AI assistant from the incident response dashboard — letting it structure the checklist while you keep your hand on the record.

The lesson that turns “it’s always DNS” from a groan into a fast resolution: stop guessing at records, localize the failing layer with two direct queries, and treat every TTL as a reason to be sure before you edit.