Incident Severity Classification: A Practical SEV1-to-SEV4 Guide
Severity levels decide who wakes up and how fast you move. Here's a clear, real-world rubric for SEV1-SEV4, common mistakes, and how AI helps classify under pressure.
- #incident-response
- #severity
- #sre
- #on-call
- #triage
- #escalation
Severity is the first decision in any incident, and it’s the one teams get wrong most often. Call it too low and the right people stay asleep while customers suffer. Call it too high and you burn out your on-call by waking ten people for a flaky dashboard. After 25 years of running on-call, I’ve learned that a clear, written severity rubric is one of the highest-leverage documents a team can own.
Severity is about impact, not about how scary it feels
The most common mistake is classifying by how alarming the symptom looks instead of by customer impact. A wall of red logs from a non-critical batch job is scary and low-severity. A quietly elevated payment error rate is boring-looking and a SEV1.
Anchor your rubric on two axes: how many users are affected and how badly. Everything else is secondary.
A rubric you can actually use
Adapt the thresholds to your business, but keep the shape:
SEV1 — Critical
Core functionality down or severely degraded for a large share of users; or any data loss, security breach, or financial-integrity issue. Examples: checkout failing, login down, customer data exposed. Response: all-hands, incident commander, wake whoever’s needed, customer comms within minutes.
SEV2 — Major
Significant degradation or a key feature down, but with a workaround or limited scope. Examples: search broken but browse works; one region degraded. Response: on-call plus the owning team engaged immediately, IC if it drags on, proactive status-page update.
SEV3 — Minor
Limited impact, easy workaround, non-core feature. Examples: a secondary report failing, elevated latency within SLO. Response: handled in business hours by the owning team, tracked, no middle-of-the-night page.
SEV4 — Low
Cosmetic or near-zero customer impact. Examples: a typo in a UI string, a noisy non-actionable alert. Response: ticket, normal backlog.
Make the “wake someone up” line explicit
The most important thing a rubric does is encode when you page humans at night. Write it as a hard line: SEV1 and SEV2 page immediately, day or night. SEV3 and SEV4 wait for business hours. When that’s written down and agreed in advance, the 2 AM responder doesn’t have to make a judgment call alone — and nobody resents being woken because the rules were clear.
Bias toward declaring high, then downgrade
Severity isn’t a one-time stamp. Declare based on current information and adjust as you learn. The healthy default: when genuinely unsure between two levels, pick the higher one and downgrade once you confirm the impact is smaller. It’s cheaper to stand down extra responders than to lose fifteen minutes because you under-called it.
Track both the initial and final severity in your records. A pattern of “declared SEV3, resolved as SEV1” tells you your detection or triage is missing signal.
Common failure modes
- Severity inflation. Everything becomes a SEV1 because nobody wants to be the one who under-called it. The fix is a clear rubric plus a blameless culture, so under-calling isn’t punished.
- Severity by seniority. A VP joins the channel and the SEV magically rises. Severity is set by impact, not by who’s watching.
- Sticky severity. It got declared SEV1 at the start and nobody ever downgrades, so the postmortem stats are useless. Re-evaluate as facts change.
- No owner for the decision. Decide in advance who can set and change severity — usually the incident commander or first responder.
Where AI helps classify faster
Under pressure, mapping a messy symptom to a severity level eats time. AI is a useful second opinion here — not to make the call, but to reason through it quickly.
Paste the firing alerts, affected components, and any user-impact signal and ask:
“Given these alerts and this user-impact data, which severity (SEV1-SEV4) does this map to under a rubric where SEV1 is core-down for many users and SEV2 is major-with-workaround? Explain the reasoning and list what additional signal would change the classification.”
That last clause is the valuable part — it tells you what to go measure to confirm. The human still owns the decision; the model just structures it.
We keep incident-response prompts for fast triage, and the Incident Response tool turns raw symptoms into a structured assessment you can use to set severity with confidence.
Tie severity to a defined response, not just a label
A severity level is only useful if it triggers a known response. The number itself does nothing; what matters is that “SEV1” automatically means these people page, this comms cadence starts, and this much authority is on the call. Write the response next to each level in your rubric so declaring a severity is the same act as kicking off the right machinery. Teams that treat severity as a label they argue about — rather than a switch that starts a defined process — lose the time the label was supposed to save.
It also helps to define a few concrete example incidents per level from your own history. “Remember the checkout outage in March? That’s a SEV1” anchors the abstract rubric to something everyone lived through, and it settles the borderline cases far faster than re-reading the definitions at 2 AM.
Review severity calls in the postmortem
Every postmortem should briefly note whether the severity was called correctly and adjusted appropriately. Over time these notes reveal systemic bias: if you routinely start incidents one level too low and bump them up fifteen minutes in, your detection or triage is missing early signal, and that’s a fixable problem. If everything inflates to SEV1, your culture is punishing under-calls and you need to make the blameless contract real. Severity data, aggregated, is one of the cheapest health checks you have on the whole incident program.
Put it on one page
Your severity rubric should fit on a single page that any responder can open in seconds during an incident. Define each level by impact, state the paging rules, and name who owns the call. That one page prevents more 2 AM arguments than any tool ever will.
AI severity suggestions are advisory. The on-call engineer or incident commander owns the final classification.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.