The First Five Minutes: AI-Assisted Incident Triage

The page wakes me at 2 a.m.: latency-slo-burn on payments. The next five minutes decide the whole incident. Do I declare a SEV1 and wake up four people, or is this a single degraded region I can babysit alone until morning? Who even owns the payments service this quarter after the last reorg? I’ve watched good engineers lose twenty minutes here — not fixing anything, just flailing to figure out how big this is and who should be in the room. That flailing is pure MTTR, and it’s avoidable.

Triage is three questions: how bad, how wide, and whose. None of them require a fix. All of them require reading and correlating a pile of scattered signals fast — which is exactly where AI earns its keep in the MTTR funnel.

Triage is a classification problem, not a fix

The mistake is treating the first five minutes as the start of debugging. It isn’t. It’s classification. You’re not asking “what’s broken and how do I fix it” yet — you’re asking “what’s the right response posture.” Get the posture wrong and everything downstream is wrong: under-call it and you’re solo on a SEV1, over-call it and you’ve burned your team’s trust waking them for a blip.

So the first five minutes have a tight, answerable scope:

Severity: how much customer pain, right now?
Blast radius: one pod, one region, one tenant, or everyone?
Ownership: who’s accountable for this service and who’s the right secondary?

Gather the blast-radius picture fast

Blast radius is the one humans get wrong most under stress, because it takes several queries to scope and we tend to assume the worst. Pull the actual shape of the impact before you decide. A few queries get you most of the way:

# Is the SLO burn global or scoped to a region/zone?
curl -s "http://prom:9090/api/v1/query?query=\
sum%20by(region)(rate(payment_errors_total[5m]))" \
  | jq -r '.data.result[] | "\(.metric.region): \(.value[1])"'

# What fraction of total payment traffic is failing right now?
sum(rate(payment_requests_total{status="error"}[5m]))
  /
sum(rate(payment_requests_total[5m]))

# Which pods are actually unhealthy vs the fleet?
kubectl get pods -n payments \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}' \
  | grep -v Running

Now you have facts: one region, 12% of traffic, three crash-looping pods. That’s a very different incident than “everything is down,” and it’s the difference between a SEV1 and a SEV3.

Let AI assemble the triage card

The signals are scattered across Prometheus, kube state, your ownership registry, and the deploy log. A human assembles them serially and slowly. Hand the raw outputs to a model and ask for a structured triage card — its strength is turning a heap of facts into a clean classification in seconds.

You are triaging a production incident. Given the alert, per-region error breakdown, failing-traffic fraction, unhealthy pod list, the service ownership entry, and recent deploys, produce a triage card with exactly these fields: Severity (SEV1–4) with the one fact justifying it, Blast radius (scope + % of users affected), Primary owner and suggested secondary, and Open questions still unresolved. Classify only from the data given. Do not propose mitigations. If severity is ambiguous, say so and name the missing signal.

The result reads like this:

Severity: SEV2 — 12% of payment traffic failing in a single region, no global impact. Blast radius: us-east-1 only; ~12% of users in that region; other regions nominal. Primary owner: payments-platform (on-call: rotation B). Suggested secondary: networking (region-scoped failure pattern). Open questions: Is this correlated with the 01:52 payments deploy? Region-isolated or load-balancer routing?

That card is the thing I actually needed at 2 a.m. It didn’t fix payments and it didn’t tell me what to do — it told me how big this is, who to wake, and what I still don’t know. I can pressure-test it in fifteen seconds because every claim points at a query I can re-run.

Ownership is half the battle

The “whose” question quietly eats more first-five-minutes time than severity, especially after reorgs. Don’t make the model guess it — feed it your source of truth. Keep ownership in a file the triage step can read:

# Resolve the owning team and current on-call from your registry
yq '.services["payments"]' ownership.yaml

services:
  payments:
    team: payments-platform
    escalation: ["rotation-b", "networking"]
    slack: "#payments-incidents"
    runbooks: "https://wiki/runbooks/payments"

Now the triage card resolves ownership from data instead of from your fuzzy memory of who sat where. When the model names a secondary, it’s picking from your declared escalation list — a suggestion, not an autonomous page. You still decide who actually gets woken up.

The human stays the incident commander

Everything above accelerates the inputs to triage. It does not make the call. The model proposing “SEV2” is a ranked suggestion with its reasoning attached, and your job is to accept, overrule, or push for the missing signal. I’ve overruled the model up to a SEV1 because I knew a downstream batch job was about to amplify the blast radius in a way no metric showed yet. Context the model can’t see is exactly why a human commands the incident.

Three habits keep AI triage trustworthy:

Demand the justifying fact for severity. “SEV2 because 12% of traffic” is auditable. “SEV2” alone is a vibe.
Force open questions. A triage card with no unknowns is overconfident. The unresolved questions become your first investigation threads.
Re-triage as facts change. Severity is a snapshot. If blast radius grows, re-run the card; don’t anchor on the 2 a.m. classification at 2:20.

You can rehearse this on our free incident assistant: paste a real alert and the scoping queries you’d run, and watch it build the triage card. Grab a tuned starting prompt from the prompt library and adapt the severity rubric to your own SLOs.

The first five minutes don’t fix anything — but they decide whether the next fifty go smoothly or sideways. AI compresses the gathering so you spend those minutes deciding, not flailing. The classification is yours. The reading is the machine’s.

Triage is a classification problem, not a fix

Gather the blast-radius picture fast

Let AI assemble the triage card

Ownership is half the battle

The human stays the incident commander

Download the Free 500-Prompt DevOps AI Toolkit