Designing an Incident Severity Matrix: Impact vs Urgency

Most teams start with a flat severity list: SEV1 is “everything’s on fire,” SEV4 is “annoying but fine.” It works right up until two engineers look at the same incident and pick different numbers — one anchoring on how bad the impact is, the other on how fast it’s moving. Then you’re arguing about labels while the clock runs.

The fix isn’t a longer list. It’s a second axis. After a decade of being the person who gets paged when the severity is “unclear,” I’ve come to believe a severity matrix — impact crossed with urgency — is the single highest-leverage piece of incident process you can write down.

Why one axis isn’t enough

Severity is doing two jobs at once, and they pull in different directions.

Impact answers “how bad is the blast radius right now?” — how many customers, how much revenue, what data.
Urgency answers “how fast is it getting worse, and how soon must we act?”

A slow data-corruption bug affecting a handful of accounts is low urgency but catastrophic impact. A login outage that’s actively spreading is high urgency and high impact. A single flat number forces you to average those, and averaging loses the signal you actually need to decide who to wake up.

The two-axis grid

Define impact and urgency each on three levels, then map the 3x3 grid to your existing SEV labels.

	Urgency: Low	Urgency: Medium	Urgency: High
Impact: High	SEV2	SEV1	SEV1
Impact: Medium	SEV3	SEV2	SEV2
Impact: Low	SEV4	SEV3	SEV3

The grid does the arguing for you. Instead of “is this a SEV1 or SEV2?” the conversation becomes two smaller, more answerable questions: how much is broken, and how fast is it moving. Those have crisper answers.

Writing impact levels that don’t require judgment

The most common failure is impact criteria that sound objective but aren’t. “Major customer impact” means nothing at 3am. Anchor each level to things you can actually observe.

High impact — a core user journey (checkout, login, the primary write path) is unavailable or degraded for a meaningful share of users; or any confirmed data loss, corruption, or security exposure.
Medium impact — a non-core feature is down, or a core journey is degraded for a small or single-region slice; clear customer-visible symptoms but a workaround exists.
Low impact — internal tooling, cosmetic issues, or degradation invisible to customers; elevated error rates without a user-facing symptom.

Tie these to real signals where you can: a specific SLO burning, a named dashboard panel, a count of affected tenants. The goal is that two tired engineers reading the same evidence land on the same row.

Writing urgency levels

Urgency is about trajectory and time-to-harm, not current size.

High urgency — actively worsening, or a hard deadline is approaching (cert expiry, disk filling, a queue backing up toward a cliff). Minutes matter.
Medium urgency — stable but unresolved; will cause real harm within hours if untouched.
Low urgency — contained and not spreading; can wait for business hours.

The trap is treating high impact as automatically high urgency. A region that’s already fully down and stable is high impact but the urgency clock has, in a sense, already run out — you’ve absorbed the hit and now you’re recovering. That nuance changes who you page and how hard.

Wiring the matrix to action

A severity label is useless unless it deterministically triggers behavior. For each SEV, predefine:

Response time — how fast someone must acknowledge (SEV1: page immediately, 5-minute ack; SEV3: next business hour).
Roles activated — does this pull in an incident commander, a comms lead, an executive bridge?
Comms cadence — SEV1 might mean status-page updates every 30 minutes; SEV3 needs none.
Escalation path — who gets pulled in if the first responder is stuck.

Without this wiring, people lowball severity to avoid the heavyweight process — which is exactly backwards. Make the process proportional and people will classify honestly.

Handling the edge cases up front

Two rules prevent most matrix disputes:

Round up when uncertain. If you can’t decide between two cells, take the higher severity. You can always downgrade once you have more information, and downgrading is cheap. Underreacting to a real SEV1 is not.
Severity is a snapshot, not a verdict. Re-evaluate as the incident evolves. A SEV3 that starts spreading becomes a SEV1 — explicitly announce the change so everyone re-syncs on the new cadence and roles.

A worked example

Checkout p99 latency doubles at 02:14, error rate climbs slowly, one region affected.

Impact: core journey degraded, single region, workaround (retry) exists → Medium.
Urgency: error rate is climbing, not stable → High.
Grid lookup: Medium impact x High urgency → SEV2.

That tells you exactly what to do next: activate the on-call IC, start a 30-minute comms cadence, and don’t wake the VP yet. No debate, no averaging — just two reads and a table.

Roll it out

Put the matrix where people declare incidents — the bot command, the runbook header, the incident template. Practice it in your next gameday so the grid lookup is muscle memory before it’s needed under pressure. And revisit the cell-to-SEV mapping quarterly; if every SEV2 keeps getting re-declared as SEV1, your thresholds are off.

We keep severity-matrix and classification prompts in our incident-response toolkit, and if you want a model to suggest a starting severity from your symptoms, that’s built into the AI Incident Response Assistant — though the final call always stays with the human in the room.

Severity suggestions are assistive, not authoritative. Calibrate any matrix against your own systems and customer base before relying on it during a live incident.