Skip to content
CloudOps
Newsletter
All guides
Post Mortems with AI By James Joyner IV · · 11 min read

Counterfactual Analysis in Postmortems: What Would Have Caught This Sooner

The best postmortem question is 'what would have caught this sooner?' Here's how to run counterfactual analysis with AI to turn incidents into real detection wins.

  • #postmortems
  • #postmortem
  • #ai
  • #detection
  • #observability

The most useful sentence I’ve ever heard in an incident review came from a grumpy network engineer who interrupted the root-cause debate to ask: “Forget why it happened—what would have told us forty minutes earlier?” The room went quiet, because the honest answer was “a synthetic check on the third-party DNS provider that we’d talked about adding and never did.” That one counterfactual produced a better action item than the entire root-cause chain we’d been arguing about. The incident was going to recur in some form regardless; the detection gap was the thing we could actually close.

Most postmortems answer “why did this happen?” Fewer answer the question that prevents the next one: at each point in the timeline, what control—if it had existed—would have caught this sooner or stopped it entirely? That’s counterfactual analysis, and it’s where AI earns its place, because it’s a structured, slightly mechanical reasoning task that humans skip when they’re tired.

Counterfactuals turn timelines into roadmaps

A timeline is a record of what happened. A counterfactual analysis walks the same timeline and asks, at each gap, “what would have changed this?” There are two flavors and you want both:

Detection counterfactuals ask how you could have known sooner. The gap between “it broke” and “we noticed” is almost always your cheapest, highest-leverage fix. An alert that fires four minutes earlier is worth more than a heroic responder.

Prevention counterfactuals ask what would have stopped it entirely—a guardrail, a test, a staged rollout, a stricter validation. These are more expensive and you can’t add one for every incident, so you have to prioritize. But naming them keeps the option visible.

The trap to avoid: hindsight bias. It’s easy to write counterfactuals that are really just “the engineer should have known,” which is blame wearing a lab coat. A good counterfactual proposes a system change that would have helped anyone in that seat, not a smarter human.

A prompt that walks the timeline for you

I feed the model the verified timeline and ask it to propose counterfactuals at each meaningful step. The crucial constraint is that every proposal must be a concrete, buildable control—not “be more careful.”

You are doing counterfactual analysis on an incident timeline.
For each timeline event, ask two questions:

(A) DETECTION: What specific, buildable signal (alert, synthetic
    check, dashboard, SLO burn-rate alarm, log-based metric) would
    have surfaced this problem at or before this point?
(B) PREVENTION: What specific guardrail (test, validation, staged
    rollout, circuit breaker, schema check, access control) would
    have stopped the chain here?

Rules:
- Propose only system/tooling changes. Never "the engineer should
  have noticed" or "more careful review." If your answer requires a
  human to be smarter, rewrite it as a control.
- Be concrete enough to file as a ticket (name the signal, the
  threshold, the system it lives in).
- Mark each as DETECTION or PREVENTION and estimate effort: S/M/L.
- If a control already existed but failed, say why it didn't fire.

Timeline:
<paste verified timeline>

That last rule—“if a control existed but failed, say why”—catches the most embarrassing and most valuable finding: the alert that was configured but had a threshold so loose it never tripped, or the test that existed but ran against a mock. Those are nearly free to fix and shockingly common.

What the output looks like before you prune it

The model returns more candidates than you’ll act on, which is correct—you want the menu, then you choose. After I prune to what’s worth funding, the section reads like this:

Counterfactual analysis

WhenWhat would have caught itTypeEffort
02:14 (failover stalled)Synthetic check on replica promotion status, alerting if a primary isn’t promoted within 60s of failoverDetectionM
02:14 (failover stalled)Failover automation should verify promotion and roll back if it stalls, rather than reporting success on initiationPreventionL
02:51 (human noticed)Burn-rate alert on the API SLO at 2% budget consumed per minute would have paged at ~02:19 instead of relying on a manual catchDetectionS
02:55 (retry storm)Existing rate limiter was configured but its threshold (10k rps) was above peak retry load; tightening to 4k would have engagedPreventionS

That last row is the gold: a control that existed and didn’t help because of a number. The “S” effort and the named threshold mean it ships this sprint. AI surfaced it from the timeline; a human confirmed the rate limiter’s real config before it went in the doc.

The human owns the judgment, always

Two things stay firmly with a person. First, pruning—the model will happily propose a synthetic check for every conceivable failure, and a postmortem that recommends fifteen new alerts is one that recommends nothing, because nobody will build all of them. Pick the two or three with the best leverage-to-effort ratio. Second, the honesty check: make sure no counterfactual smuggled in blame. “A reviewer would have caught the bad migration” is not a control; “a migration linter that blocks un-reversible DDL in CI” is. If the proposal needs a human to have been smarter, it’s not done yet.

Run cleanly, this section is what converts a postmortem from a history lesson into a detection roadmap. The AI does the methodical timeline walk you’d skip at the end of a long day; you decide which gaps are worth closing.

The counterfactual prompt lives with my other incident snippets in the prompts library, and it pairs naturally with the action-item and impact work covered across the postmortems category. If you want the surrounding template, the blameless postmortem guide has it.

Ask what would have caught it sooner. The answer is usually cheaper than you think and worth more than the root cause.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.