Error-Budget-Aware Severity Calibration With AI

We once ran a full war room — bridge open, six engineers pulled off their work, an exec lurking — for an incident that turned out to be a cosmetic dashboard glitch affecting internal users only. The same quarter, we under-called a quiet latency creep that was burning the checkout SLO budget at a rate that would exhaust the month in two hours, and it sat at SEV4 with one half-attentive responder for forty minutes. Both were severity-calibration failures, and both inflated MTTR: one by over-mobilizing, the other by under-mobilizing. Getting the response size right the first time is its own lever on resolution speed.

The problem is that severity often gets set by how loud the alerts are rather than by how much SLO damage is being done. Those aren’t the same thing, and a model can help translate one into the other — quantifying budget burn so the rubric, not the alarm volume, sets the level.

Severity sets the response size, and both directions hurt

Under-call and the right people don’t get paged, mitigations wait, and a budget-burning event drags. Over-call and you mobilize a crowd for a glitch, then pay for it later in alert fatigue and burnout that makes the next page slower. The calibration that matters is grounding severity in SLO impact, which is the same evidence-over-instinct theme the MTTR funnel returns to again and again. Right-sizing the response early is what keeps the whole timeline proportionate.

A model handed your SLO, current burn, and remaining budget can do the math that turns “errors are up” into “the month’s budget is gone in ninety minutes” — which is the framing that makes a severity call defensible.

Ask for budget math mapped to your rubric

The framing keeps the AI advisory and anchored to written criteria.

You are calibrating incident severity from SLO impact, not from alarm volume, and this is advisory — the IC decides. Affected SLO: checkout availability, 99.9% monthly. Current burn: 2% error rate. Remaining budget this month: ~30%. Scope: ~15% of checkout traffic, a revenue path. Here’s our SEV rubric. Quantify the burn (how long until budget exhausts at this rate), map to the rubric with justification, list what would push it up or down a level, recommend a proportionate response, and state the metric change that should trigger re-calibration.

The output is a defensible call:

Burn: at 2% error on a 0.1% budget, you’re burning ~20x. Remaining 30% budget exhausts in ~90 min at this rate. Severity: SEV2 per rubric (revenue path + budget exhaustion < 4h). Not SEV1 — no full outage, fallback partially holding. Swing up: acceleration of burn, fallback failing, scope spreading past checkout. Swing down: off-peak traffic drop, a fix deploying. Response: page checkout on-call + DB on-call, open a bridge, no public declaration yet. Re-evaluate trigger: error rate > 5% or budget-to-exhaustion < 30 min → reassess as SEV1.

That’s a severity you can justify to anyone, grounded in burn rate rather than the fact that a lot of alerts are flashing.

Translate the symptom into budget terms

# Current burn rate against the SLO (read-only)
curl -s "http://prom:9090/api/v1/query?query=\
sum(rate(http_requests_total{service=\"checkout\",code=~\"5..\"}[5m]))\
/sum(rate(http_requests_total{service=\"checkout\"}[5m]))" \
  | jq -r '.data.result[].value[1]'

# Remaining error budget for the window (if you track it as a recording rule)
curl -s "http://prom:9090/api/v1/query?query=\
slo:error_budget_remaining:ratio{service=\"checkout\"}" \
  | jq -r '.data.result[].value[1]'

The burn-rate-to-exhaustion translation is the move that separates loud-but-harmless from quiet-but-dangerous. A 2% error rate sounds modest until you frame it as 20x budget burn with ninety minutes to exhaustion — and a flood of alerts on a path with a working fallback might warrant less urgency than the math suggests at first glance.

Advisory only — the IC owns the call

Severity decides who gets woken at 3 a.m. and how hard the team mobilizes, so the model proposes and the IC decides. An AI that under-calls a budget-burner delays the response; one that over-calls trains the team to ignore pages. Neither outcome should come from a tool.

Rules I hold to:

Sanity-check the SLO inputs first. A stale remaining-budget number or an SLI that misses real user pain produces a confidently wrong level. Garbage in, wrong severity out.
Keep it proposing, not deciding. The burn math is input to the IC’s judgment about who to page, never a substitute for it.
Honor the re-evaluate trigger. Severity isn’t set once; the metric threshold that should bump it up or down belongs in the channel.

You can practice this on the free incident assistant — paste an SLO and current burn and ask for the budget-grounded severity calibration, then notice how the burn-to-exhaustion framing changes the call. The prompt library has a hardened severity-calibration prompt with the advisory-only guardrail built in.

Mis-set severity inflates MTTR by sizing the response wrong in either direction, and the fix is grounding the call in SLO impact instead of alarm volume. AI does the burn-rate math that makes severity defensible — and as long as it stays advisory and the IC owns who gets paged, the team mobilizes proportionately and resolves at the right pace from the first minute.