Diagnosis Accelerator: Verify-First Hypotheses Prompt

Turn the opening burst of telemetry into a short, ranked list of diagnoses — each paired with a single command to confirm or kill it — so the team tests the likeliest cause first and shortens time-to-diagnose.

Target user

On-call SREs and responders mid-incident

Difficulty

Advanced

Tools

Claude, ChatGPT, Cursor

You are a senior SRE who diagnoses incidents by forming a small set of testable hypotheses and ruling them out fast — never by guessing and patching. Help me do that with the telemetry I have in the opening minutes. Paste the signals you have: - Alert and SLO context: [WHAT FIRED + THRESHOLD BREACHED] - Metrics: [ERROR RATE / LATENCY / SATURATION / TRAFFIC, WITH TIMESTAMPS] - Logs: [RELEVANT LOG LINES / ERROR MESSAGES] - Traces or topology: [SLOW SPANS / FAILING DEPENDENCY, IF ANY] - Recent changes: [DEPLOYS / CONFIG / FLAGS IN THE WINDOW] Produce a verify-first diagnosis plan: 1. **Read the signal shape** — describe what the telemetry pattern looks like (sudden vs. ramp, correlated across services or isolated, latency-then-errors vs. errors-first) and what families of cause that shape is and is not consistent with. 2. **Generate 3-5 distinct hypotheses** — span different layers (bad deploy, dependency failure, resource saturation, config/flag, data/poison-message, external/network). Make them genuinely different, not five flavors of "the deploy". 3. **Rank by likelihood** — order them with an explicit confidence and a one-line rationale tied to the actual signal, not generic priors. Note which hypotheses the current evidence already weakens. 4. **Attach a verification step to each** — for every hypothesis, give the single fastest read-only command or query that would confirm or refute it (a log filter, a metric query, `kubectl describe`, a dependency health check). Order the whole list so the cheapest, most-decisive check comes first. 5. **Define the kill criteria** — for the top hypothesis, state exactly what result would rule it out and which hypothesis to test next. Output format: a ranked table — hypothesis | confidence | why | verification command | what-confirms / what-refutes. Then a one-line "test this first" recommendation. Strictly propose and rank only: do not assert a root cause, and every command must be read-only and safe to run against production. The human runs the checks and decides.

Why this prompt works

This is built for the diagnose phase, usually the largest single slice of MTTR. Diagnosis time inflates when responders either fixate on the first plausible cause or thrash across untestable theories. The antidote is a disciplined hypothesis-and-verify loop, and that discipline is exactly what an LLM can scaffold quickly from a pile of telemetry.

The prompt forces hypotheses to span different failure layers so the team doesn’t tunnel on the obvious deploy, and it pairs every hypothesis with a single decisive, read-only check ordered cheapest-first. That ordering is the real time-saver: it turns “what could this be” into “what is the fastest experiment that eliminates the most possibilities,” which is how experienced SREs actually shrink the search space.

The kill-criteria step and the verify-first guardrail are deliberate defenses against anchoring. A ranked list is psychologically sticky, so the prompt requires the model to state what would disprove its top hypothesis and to never assert causation or take action. The human keeps control of every command and every conclusion, getting the speed of structured reasoning without surrendering the skepticism that prevents a confident wrong diagnosis from extending the incident.

Diagnosis Accelerator: Verify-First Hypotheses Prompt

Why this prompt works

Related prompts

Have We Seen This Before? Symptom-Match Prompt

Runbook and Next-Step Surfacer Prompt

Why this prompt works

Related prompts

Have We Seen This Before? Symptom-Match Prompt

Runbook and Next-Step Surfacer Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet