Detection-Gap Analysis: Finding Where the Incident Stayed

Every incident has a silent period: the stretch where the system is already broken and nobody knows yet. It starts when the fault begins doing damage and ends when a human or an alert finally notices. On a well-instrumented service it’s seconds. On a poorly instrumented one it’s the forty minutes a customer spends filing a support ticket before your dashboards say anything is wrong. That silent period is, almost always, the cheapest part of the incident to fix — because you’re not preventing the failure, you’re just shrinking the window before you find out.

And yet most postmortems dispose of detection in a single line: “alerting was delayed.” That sentence hides the most fixable part of the whole incident. Detection-gap analysis is the discipline of pulling that line apart and turning it into specific, scoped, prioritized signals — and it’s a structured timeline walk, which is exactly the kind of methodical work AI is good at and tired humans skip.

Three timestamps, not one

The single most useful move in detection analysis is to split “when did this happen” into three distinct moments:

Fault onset — when the underlying problem began.
Impact onset — when it started affecting users or SLIs.
Detection — when anyone actually knew.

The gap between impact onset and detection is your silent period, and naming it as a number changes the conversation. “We detected it slowly” is a shrug. “Customers were getting 500s for eleven minutes before any alert fired” is a finding with a price tag. Most postmortem timelines blur these three into one fuzzy “the incident started around 2:15,” which makes the silent period invisible and therefore unfixable.

Why detection failed matters more than that it failed

The second move is to classify the type of detection failure, because the fix is completely different for each:

No signal existed. Nothing was watching this. The fix is to add a signal.
A signal existed but didn’t fire. Threshold too loose, scope too narrow, wrong window. The fix is to tune what you have — often nearly free.
A signal fired but to the wrong place, or drowned in noise. The fix is routing or noise reduction, not another alert.
A human or customer caught it before automation did. The fix depends on which of the above applies to the signal that should have beaten them to it.

Conflate these and you make things worse. The classic mistake is responding to any missed detection by adding more alerts — which, if the real problem was noise, just builds the next ignored alert. I have a model walk the timeline and force the classification.

Analyze the detection gap in this incident.

1. Mark the detection timeline: fault-onset, impact-onset,
   actual-detection. State time-to-detect and how much was
   "silent" (impacting before anyone knew).
2. Classify the failure: no signal existed / signal existed but
   didn't fire (threshold/scope/routing) / signal fired but went
   to the wrong place or was lost in noise / caught by a
   human/customer instead of automation.
3. For each gap, propose the most specific, lowest-cost signal
   that would have detected this earlier — name the metric, a
   starting threshold, and where it lives. Prefer burn-rate /
   SLO-based alerts over static thresholds where appropriate.
4. Flag any alert that fired but was ignored — is the problem the
   alert or the noise around it?
5. Estimate how much earlier each proposed signal would detect.

Rules: Frame gaps as signal/tooling deficits, never "someone
should have been watching." Mark timing estimates as [ESTIMATE].

Timeline & existing signals: <paste>

The most valuable row is the alert that already existed

After a human verifies the real alert configs, the output looks like this:

Detection analysis — silent period: 11 minutes

Gap Failure type Proposed signal Est. earlier
500s on checkout went unalerted Signal existed, didn’t fire Burn-rate alert at 2% error-budget/min on checkout SLO (current alert is a static 5% error rate, which this 1.5% spike never tripped) ~9 min
DB connection saturation No signal existed Gauge alert on pool utilization >85% for 2 min ~6 min
The page that finally caught it Caught by customer first n/a — the burn-rate alert above would have beaten the customer report —

Highest-leverage fix: replace the static error-rate alert with a burn-rate alert. It already exists, it’s a threshold change, and it would have collapsed most of the silent period.

Gap	Failure type	Proposed signal	Est. earlier
500s on checkout went unalerted	Signal existed, didn’t fire	Burn-rate alert at 2% error-budget/min on checkout SLO (current alert is a static 5% error rate, which this 1.5% spike never tripped)	~9 min
DB connection saturation	No signal existed	Gauge alert on pool utilization >85% for 2 min	~6 min
The page that finally caught it	Caught by customer first	n/a — the burn-rate alert above would have beaten the customer report	—

That top row is gold and it’s the one a tired human misses: an alert that was configured and didn’t help because of a number. A 5% static threshold that a real 1.5% failure never reached is a near-free fix — you’re changing a config, not building observability — and it’s shockingly common. AI surfaced it from the timeline; a human confirmed the real threshold before it went in the doc.

Don’t punish the responder for the silent period

The blameless framing is specific here. It’s tempting to read a long silent period as “nobody was watching the dashboards,” which quietly blames the on-call engineer for not staring at a screen at 2 a.m. That’s not the finding. The finding is that the system should have paged them, and didn’t, because of a missing or mis-tuned signal. Detection is a property of your instrumentation, not your responder’s vigilance — and a postmortem that frames it otherwise both misses the real fix and erodes the blameless culture you need for honest writeups. The prompt keeps every gap framed as a signal deficit.

The estimates are inferred, so mark them

“This would have detected nine minutes earlier” is inferred from the timeline, not measured. It’s a useful number for prioritizing, but it’s a model’s reasoning about counterfactual timing, and if it ends up quoted as fact in the postmortem it’s a fabrication wearing a precise figure. Marking these as estimates keeps them in their lane: good enough to rank fixes, not solid enough to assert.

The human picks what to build

The model turns “we should have caught it sooner” into a scoped, classified, prioritized menu. A person still decides which signals are worth the alerting budget — because every new alert is also a future page, and a postmortem that recommends ten new alerts has recommended a future noise problem. Pick the highest-leverage one or two, usually starting with the alerts that already exist and just need tuning.

The detection-gap prompt lives in the prompts library, and it pairs naturally with counterfactual analysis for the broader “what would have caught this” question. For the document it feeds into, the blameless postmortem guide has the surrounding template.

Find the silent period, give it a number, and close it. It’s almost always cheaper than the root cause and worth more than you’d guess.

Detection-Gap Analysis: Finding Where the Incident Stayed Silent