Incident Metrics That Matter: MTTA, MTTR, and MTBF

Every reliability dashboard I’ve inherited has had the same problem: a dozen incident metrics, beautifully charted, that nobody has ever used to make a decision. The metrics exist to be reported, not to be acted on. They go up, they go down, somebody screenshots them for a slide, and the team’s actual reliability is unchanged.

Incident metrics are only worth measuring if they change what you do. After years of building these programs, here’s the short list that earns its place — what each one actually tells you, how to measure it without lying to yourself, and the traps that make the numbers meaningless.

The core trio

Three metrics form the backbone, and each maps to a distinct phase of an incident’s life:

MTTA — Mean Time To Acknowledge. How long from the alert firing to a human owning it. This measures your paging and on-call health, nothing else.
MTTR — Mean Time To Recover/Resolve. How long from detection to service restored. This measures your response capability.
MTBF — Mean Time Between Failures. How long, on average, between incidents. This measures your underlying reliability.

They answer different questions, so don’t average them into one health score. A team with great MTTR and terrible MTBF is firefighting heroically while the house keeps catching fire — and a single blended number would hide that.

MTTA: are pages reaching awake humans?

MTTA is the cleanest metric because it has the fewest confounders. A high or volatile MTTA almost always points at a paging problem, not a people problem:

Escalation timeouts too long, so the second responder takes the page after the first sleeps through it.
Schedule gaps routing pages into the void.
Alert fatigue, where the on-call has muted a noisy channel and misses the real one.

If MTTA is bad, fix the pipes — escalation policies, schedules, notification channels — before you ask people to “try harder.” Target single-digit minutes for high-severity pages.

MTTR: the metric everyone games

MTTR is the headline number and the easiest to fool yourself with. The biggest trap: MTTR is an average, and incident durations are wildly skewed. One six-hour outage drowns out fifty four-minute blips. A mean is the wrong summary for skewed data.

Report distributions, not just the mean:

Median (p50) — your typical incident.
p90 — your bad-but-not-worst incident, often the most actionable number.
Count by severity — separate MTTR for SEV1 vs SEV3; mixing them is meaningless.

The deeper trick is to decompose MTTR into stages: time to detect, time to acknowledge, time to diagnose, time to fix, time to verify. The total is rarely where the time goes — it’s usually one stage. If diagnosis eats 70% of every incident, no amount of “deploy faster” tooling helps; you need better observability. The aggregate MTTR hides that; the breakdown reveals it.

MTBF: the one that actually means reliability

MTBF is the metric leadership should care most about and usually ignores, because it moves slowly and isn’t dramatic. But it’s the truest signal: if incidents are getting less frequent, your reliability work is paying off. If MTTR is improving but MTBF is flat, you’ve gotten better at mopping while the leak continues.

The catch: MTBF requires a consistent definition of “a failure.” If your incident-counting changes — you start declaring more SEV3s, or stop logging the small ones — MTBF moves for reasons that have nothing to do with reliability. Lock the definition down and keep it stable, or the trend is noise.

Measuring honestly: the definitions that make or break the numbers

Most metric programs fail not at the math but at the definitions. Pin these down before you chart anything:

When does the clock start? Detection time, or when the customer first felt it? They can differ by a lot, and “time to detect” is itself a metric worth tracking.
When does it stop? First mitigation, or full resolution? Be explicit — a mitigated-but-degraded service is not recovered.
What counts as an incident? Draw the line at a severity threshold and hold it. Inconsistent inclusion corrupts every trend.
One incident or several? A cascading failure that trips five alerts is one incident, not five. Dedup before you count.

Write these definitions down and apply them mechanically. The goal isn’t impressive numbers; it’s comparable numbers across quarters.

Metrics that look useful but aren’t

A few that tend to mislead:

Number of incidents, alone. Without severity weighting, fifty cosmetic blips look worse than two catastrophic outages. Weight by impact.
MTTR as a target to beat. Make MTTR a KPI people are graded on, and they’ll close incidents early, downgrade severities, and stop logging the hard ones. You’ll “improve” MTTR by corrupting the data.
Vanity uptime. “99.9% uptime” measured by a healthcheck that doesn’t exercise the broken path is theater. Measure what customers experience.

Turning metrics into action

A metric earns its dashboard slot only if it has a wired response:

MTTA trending up → audit escalation policies and schedules this week.
Diagnosis stage dominating MTTR → invest in observability and runbooks.
MTBF flat or worsening → your postmortem action items aren’t landing; check follow-through.
One service dominating incident count → that’s your next reliability investment, not a mystery.

Review these monthly with the people who can act on them, not quarterly with a slide deck. The cadence is what turns measurement into improvement.

Keep the set small

The temptation is always to add more metrics. Resist it. A focused set — MTTA, a decomposed MTTR distribution, and MTBF with stable definitions — that the team actually looks at beats a sprawling dashboard nobody trusts. Reliability improves when metrics drive decisions, and decisions come from a handful of numbers you understand deeply.

We keep templates for incident metrics and postmortem follow-through in our incident-response toolkit — because the metrics only matter if the action items they generate actually get done.

Metric definitions and targets here are starting points. Calibrate thresholds and targets against your own systems, severity model, and customer expectations.