Five-Whys vs Causal Graphs: When Each Postmortem Method Fits

I once watched a postmortem reach “the engineer pushed a bad config” as its root cause via a confident, four-step five-whys chain, and everyone nodded, and the action item was “add a config review step.” Three weeks later the same class of incident happened with a different engineer and a different config, because the actual problem — that there was no validation between “config merged” and “config live in production” — had been three whys away and nobody walked there. The five-whys didn’t lie. It just stopped at the first answer that felt like a person.

That’s the core tension in causal analysis. Five-whys is fast, teachable, and produces a satisfying single answer. Causal graphs are slower, messier, and produce a web of contributing factors. Most teams use five-whys for everything because it’s easy, and most of the time that’s fine — until the incident is genuinely multi-causal and the linear method quietly erases everything except one link.

What five-whys is good at, and where it fails

Five-whys works beautifully on incidents that really are linear: A caused B caused C, and there’s one chain to walk. A cron job filled a disk, which crashed a service, which failed health checks, which dropped traffic. Walking back from the symptom to “we have no disk-usage alerting and no log rotation on that host” is honest and fast. For a large share of incidents, that’s all you need, and reaching for a heavier method would be ceremony.

It fails in two specific ways. The first is premature termination on a human: the chain hits “the engineer made a mistake” and stops, because that feels like an answer, when it’s actually a signal that you should keep going to the system that let the mistake reach production. The second is single-path tunnel vision: the method’s structure assumes one chain, so when an incident is the intersection of three independent conditions that were each individually survivable, five-whys picks whichever chain you started down and presents it as the cause. The other two factors vanish from the writeup.

What a causal graph captures that a chain can’t

A causal graph drops the assumption of a single line. You list the distinct contributing factors — the trigger, the latent conditions that were fine until they weren’t, the missing guardrails, the detection gap — and you map how they combined. The key thing it represents that a chain cannot: necessary-but-not-sufficient relationships. The bad config was necessary but not sufficient; it only caused an outage because there was no validation gate and the canary was disabled that week and the alert that should have caught the rollout had a threshold set too high. Remove any one of those and there’s no incident. A five-whys forces you to pick one of them as “the” cause. A graph shows you that the real finding is the combination, and that you have multiple independent places to intervene.

The cost is real: a causal graph takes longer, it’s harder to communicate, and on a genuinely simple incident it’s over-engineering that makes a trivial outage look profound. So the skill isn’t “always use graphs.” It’s knowing which incident you’re holding.

Running both with AI and comparing them

Here’s where AI is genuinely useful, because running both methods by hand is exactly the tedious, structured work people skip when they’re tired at the end of a review. I feed the model the verified facts and have it produce both representations, then — crucially — compare them.

Run this incident through two methods and compare them.

1. FIVE-WHYS: Start from the customer-facing failure. Ask "why"
   repeatedly until you reach a systemic condition, not a person
   or a single bug. Show every step.
2. CAUSAL GRAPH: List the distinct contributing factors (trigger,
   latent conditions, missing guardrails, detection gap). Note
   every place two factors were each NECESSARY but neither
   SUFFICIENT alone.
3. COMPARE: Where does the linear chain hide a factor the graph
   exposes? Where is the graph over-engineering a simple incident?
4. RECOMMEND which method fits this incident, in one sentence.

Rules: Every "why" must land on a system/process/signal, never on
a person's competence. Mark anything not in the verified facts as
[UNVERIFIED].

Verified facts: <paste>

The comparison step is the deliverable, not either method on its own. It’s what tells you whether you’re holding a simple incident that five-whys handles fine, or a multi-causal one where the chain is about to flatten three findings into one.

What the comparison surfaces

On the config incident, the two outputs diverge in exactly the instructive way. The five-whys lands on “no validation between merge and production.” The causal graph lands on the same gap plus “canary was disabled for the release window” plus “the rollout alert threshold was set above the failure’s signal.” After a human verifies those, the postmortem reads:

Contributing factors (not a single root cause):

Trigger: a config change with an invalid value reached production.

Latent condition 1: no schema validation gate between merge and live config — the value was never checked.

Latent condition 2: progressive rollout was disabled that week, so the change hit 100% of traffic at once.

Detection gap: the rollout health alert’s error-rate threshold was above the level this failure produced.

Each factor was individually survivable. The incident required all three. There are therefore three independent places to intervene, and we should not file a single “add config review” item as though one fix closes this.

That last sentence is the difference between a postmortem that prevents recurrence and one that prevents the exact recurrence while leaving two other doors open. The AI produced both analyses; a human confirmed the canary and alert facts and decided the graph was the honest model for this incident.

The human still owns the judgment call

Two things stay with a person. First, the simple-vs-complex call. The model will offer a recommendation, but you decide whether the graph is real structure or false depth — some incidents are genuinely one chain and dressing them up wastes everyone’s time. Second, the blameless check, which both methods need: any “why” or any factor that reduces to “a smarter person wouldn’t have done that” is blame, not analysis, and it has to be rewritten into the system condition that made the human action matter.

If you want the prompt that runs both methods side by side, it’s in the prompts library, and it pairs naturally with the counterfactual analysis work for deciding what to actually fix once you’ve found the factors. For the surrounding document, the blameless postmortem guide covers the template the analysis drops into.

The method is a tool, not a religion. Use the chain when the incident is a chain, reach for the graph when it isn’t, and let the comparison tell you which you’re holding.