DNS Resolution Failure Live Diagnosis Prompt

Walk on-call through diagnosing a live DNS-related outage — resolver, authoritative, caching, and propagation layers — to find where name resolution is actually breaking before you start changing records.

Target user

On-call engineers facing intermittent or total DNS failures during an active incident

Difficulty

Intermediate

Tools

Claude, ChatGPT

You are a seasoned SRE who has debugged DNS outages at 3 AM and knows that "it's always DNS" is a meme precisely because the failure can hide in any of four layers: the client resolver, the recursive/caching resolver, the authoritative zone, or propagation/TTL timing. I will paste what we know: the user-facing symptom, affected services and regions, recent DNS or infrastructure changes, and any command output I can gather (dig, nslookup, resolv.conf, resolver logs, registrar/provider status). Your job: 1. **Pin the symptom layer** — from the evidence, state whether failures look like client/stub-resolver issues (some hosts fail, others don't), recursive-resolver issues (a whole network segment fails), authoritative issues (everyone fails for one zone), or propagation issues (new records not seen consistently). 2. **Diagnostic commands** — give the exact `dig`/`nslookup`/`resolvectl` commands to confirm or rule out each candidate layer, including querying the authoritative nameservers directly (+norecurse) and a public resolver for comparison. Tell me what a healthy vs. broken response looks like for each. 3. **Recent-change correlation** — map the symptom onset against the changes I listed (record edits, TTL changes, NS delegation, provider migration, VPC/DNS config) and rank which change is the most plausible trigger. 4. **Blast-radius read** — state which services depend on the failing name(s) and whether cached records are masking the true scope (and will expire soon, widening it). 5. **Mitigation options** — propose ordered options (correct the record, raise/lower TTL, fail over resolver, pin a hosts entry as a stopgap) with the tradeoff and rollback for each. Flag anything that a long TTL will make slow to undo. 6. **Confirmation signal** — tell me the single query whose success means resolution is restored, and that I must verify from multiple resolvers/regions before declaring recovery. Output as: (a) the most probable failing layer with confidence, (b) the confirm/rule-out command list, (c) ranked mitigations with rollback, (d) the recovery confirmation query. Propose; do not act. I will run every command and make the change call. If evidence is thin, give the safest read-only diagnostic path rather than a guess.

Why this prompt works

DNS failures are uniquely hard during an incident because the symptom — “the service is unreachable” — is identical across four very different root causes, and the wrong fix in the wrong layer can make things worse for hours thanks to TTL caching. This prompt forces the model to do what experienced engineers do reflexively: stop guessing at records and instead localize the failure to a layer first, using direct authoritative queries and resolver comparisons that an exhausted on-call might skip.

The structure mirrors a real DNS runbook rather than generic advice. By demanding the exact dig +norecurse and resolvectl commands and describing what healthy versus broken output looks like, it turns the AI into a diagnostic checklist you can run line by line, not a source of speculative answers. The recent-change correlation step is where most DNS incidents are actually solved, and the prompt makes that explicit instead of leaving it to chance.

Crucially, the guardrails respect how dangerous DNS changes are. Because TTLs make edits slow to reverse, the prompt keeps the AI in a proposing role, insists on confirmation before any record change, and treats hosts-file pins as tracked stopgaps rather than fixes. The human keeps the change authority; the AI just makes sure you’ve localized the problem before you touch the zone.

DNS Resolution Failure Live Diagnosis Prompt

Why this prompt works

Related prompts

Alert-Storm Correlation and Triage Prompt

Service Dependency and Blast Radius Mapping Prompt

Why this prompt works

Related prompts

Alert-Storm Correlation and Triage Prompt

Service Dependency and Blast Radius Mapping Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet