First-5-Minutes Triage Prompt

From the alert alone, decide severity, estimate blast radius, and route to the right owner in the opening minutes — so the incident lands with the people who can fix it instead of bouncing, cutting time-to-triage.

Target user

On-call SREs and incident commanders making the first call

Difficulty

Intermediate

Tools

Claude, ChatGPT, Cursor

You are a senior incident commander walking an on-call engineer through the first five minutes. Your goal is a fast, defensible triage call: how bad, how wide, and who owns it — without overreacting or under-reacting. Paste what you have right now: - The alert(s) that fired: [ALERT PAYLOAD / LABELS / METRICS] - What customers/users would see: [USER-FACING SYMPTOM, IF KNOWN] - Service map / ownership info: [SERVICE → TEAM MAPPING, DEPENDENCIES] - Our severity rubric, if we have one: [SEV DEFINITIONS, OR "use a standard SEV1-4 scale"] Work through this: 1. **Assign a provisional severity** — map the symptom to the rubric and state the severity with one sentence of justification. If signal is thin, give the most likely severity plus the next-most-likely, and say what evidence would bump it up or down. 2. **Estimate blast radius** — which services, regions, and user segments are plausibly affected based on the labels and dependency map. Distinguish "confirmed affected" from "downstream-at-risk". Give a rough size (single tenant / one region / global) and your confidence. 3. **Route to an owner** — name the team that owns the failing component and the most likely team needed for the fix. If ownership is ambiguous, list the candidates in priority order rather than guessing one. 4. **Decide the response shape** — does this warrant a declared incident and an IC, a single on-call quietly investigating, or a watch-and-wait? Recommend one and say why. 5. **State the next concrete action** — the single most useful thing to do in the next 90 seconds. Output format: a "TRIAGE CARD" with fields SEVERITY (+confidence), BLAST RADIUS (+confidence), OWNER(S), RESPONSE SHAPE, NEXT ACTION. For the severity and blast-radius calls, attach one read-only verification command or query (e.g. error-rate query, `kubectl get pods -A`, status-page check) that would confirm or refute the estimate. Rank your hypotheses with explicit confidence; propose and verify, but do not declare the incident, page anyone, or change production yourself — the human makes those calls.

Why this prompt works

This targets the triage phase, the hinge between acknowledging a page and actually working it. The most expensive triage failures are mis-severity (treating a SEV1 as a SEV3, or vice versa) and mis-routing (the incident bouncing between teams while the clock runs). Both happen because the first responder has to make a call under uncertainty with incomplete signal.

The prompt forces three explicit, separable judgments — severity, blast radius, owner — and demands a confidence level on each so thin signal produces a hedged answer rather than false precision. By asking for the next-most-likely severity and the evidence that would change it, it keeps triage revisable instead of locking in an early wrong call that the team then defends.

The guardrail matters most here: triage is where an LLM is tempting to over-trust, because its output looks like an authoritative decision. Routing every severity and blast-radius claim through a verification query, and keeping the human as the one who declares the incident and pages people, ensures the AI accelerates the decision without making it — preserving the judgment that keeps MTTR honest.

First-5-Minutes Triage Prompt

Why this prompt works

Related prompts

Alert Enrichment: Context on the Page Prompt

Parallel Investigation Planner Prompt

Why this prompt works

Related prompts

Alert Enrichment: Context on the Page Prompt

Parallel Investigation Planner Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet