Keeping an Incident Decision Log With AI Support

Three days after a messy SEV1, we sat in the postmortem trying to answer a simple question: why did we wait forty minutes to fail over to the secondary region? Nobody could remember. There’d been a reason at the time — a good one, something about a replication lag concern — but it had lived only in the IC’s head during the incident, and by the postmortem it was gone. We could see what happened from the timeline. We had no record of why we’d chosen it. And without the why, we couldn’t tell whether the forty-minute wait was a mistake to fix or a sound call to repeat.

This is the gap a decision log fills. Most teams capture a timeline — a record of events. Far fewer capture the decisions: the choices the responders made, the reasoning behind them, and the alternatives they rejected. Yet the decisions are where the real learning lives. The timeline tells you the sequence of the world; the decision log tells you the sequence of human judgment, and human judgment is the thing you’re actually trying to improve.

Timeline versus decision log

A timeline entry says “10:42 — failed over to secondary region.” A decision log entry says “10:42 — IC chose to fail over to secondary despite replication lag concern, because primary recovery ETA was unknown and customer impact was growing; rejected waiting for primary because the wait was unbounded.” See the difference. The timeline records the action; the decision log records the reasoning and the discarded alternative. When you review the incident later, the second one is what lets you actually evaluate the call.

This matters because most incident decisions are made with incomplete information under time pressure, and the only fair way to evaluate them afterward is against what the responder knew at the time, not against what you know now with the answer in hand. Hindsight makes every decision look obvious. The decision log preserves the actual epistemic state — what was known, what was uncertain, why the call made sense then — which is the only honest basis for a blameless review.

Why decision logging usually doesn’t happen

In principle everyone agrees decision logs are valuable. In practice almost nobody keeps one, for a simple reason: during an incident, the person making the decisions is far too busy making them to also write down their reasoning. Asking the IC to pause mid-crisis and document why they chose the failover is asking them to slow down the response to feed the paperwork. They won’t, and they shouldn’t. So the reasoning evaporates, and we’re back in the postmortem unable to remember why.

The classic fix is a dedicated scribe, and it helps, but a human scribe trying to capture decisions in real time is itself overloaded, and decision reasoning is harder to capture than events because it’s often unspoken — the IC made the call in their head and only announced the action. The reasoning needs to be drawn out, which a busy scribe rarely has bandwidth for.

How AI makes decision logging feasible

Run an AI scribe over the incident channel and have it watch specifically for decisions, not just events. When the channel shows a choice being made — a failover, an escalation, a decision to wait — the model can flag it and draft a decision-log entry, prompting for the reasoning if it’s not stated. The AI Incident Response Assistant can maintain this decision log alongside the timeline, surfacing “a decision appears to have been made here — what was the reasoning?” so the record gets captured in the moment instead of reconstructed from faded memory three days later.

The key is that the model lowers the cost of capture to almost nothing. The IC doesn’t stop to write a paragraph; they say one line into the channel — “going to secondary, replication lag is a risk but primary ETA is unknown” — and the model turns that into a structured decision-log entry with the action, the reasoning, and the rejected alternative. One sentence from a human, captured permanently, evaluable forever. That’s a trade even a busy IC will make.

Pro Tip: Prompt the model to explicitly capture the alternative that was rejected, not just the choice that was made. The rejected option is half the value of a decision log — in review, “why didn’t we just wait for the primary?” is answerable only if someone recorded that waiting was considered and consciously rejected. A decision log that only captures what you did, not what you chose against, is missing the most useful half.

The reasoning the log preserved

After we started running an AI decision scribe, the next time we faced a similar failover question, the log captured it cleanly: “11:15 — IC chose to wait 10 minutes before failover; reasoning: replication lag was actively shrinking and a clean failover was worth a short wait; rejected immediate failover because it risked data inconsistency the team would have to reconcile later.” In the postmortem, that entry let us evaluate the decision properly. We could see the IC knew the tradeoff, made a defensible call against it, and the ten-minute wait was sound given what was known.

Compare that to the earlier incident where the reasoning was lost and we spent the postmortem guessing. The decision log turned an unanswerable “why did we do that?” into a clear “here’s the reasoning, was it sound?” — which is exactly the conversation a blameless review should have. The AI captured the record; the humans made the calls and, later, evaluated them. The model never decided anything; it just made sure the human’s reasoning didn’t vanish.

The decisions stay human, always

This is worth stating directly because decision logging sits so close to decision-making: the AI records decisions, it does not make them, suggest them, or rank them. Every entry in the log is a human choice, captured by the model for the record. The temptation to let the scribe start offering “you might consider failing over now” must be resisted — the moment the model is suggesting decisions, you’ve crossed from synthesis into the responder’s job, and a model has no business making operational calls under uncertainty with real consequences. It writes down what the humans decide. It never decides.

And as with every tool in the incident loop, the scribe takes no production action whatsoever. It reads the channel and writes a document. The cleanest mental model: humans make and own the decisions, the AI is the historian that ensures the reasoning behind those decisions survives. Synthesis and record-keeping for the machine; judgment and action for the people.

Pro Tip: Review your decision log in the postmortem before the timeline, not after. Leading with decisions keeps the review focused on judgment — what we chose and why — rather than drifting into a blow-by-blow of events. The timeline is context for the decisions, not the main event, and reading them in that order keeps the learning where it belongs.

From record to better judgment

A decision log turns incident review from “what happened” into “how did we decide, and how can we decide better next time” — which is the only kind of review that actually improves your responders’ judgment. The reasoning behind a failover, an escalation, a choice to wait: capture it in the moment, and your postmortems stop being archaeology and start being genuine learning about human decision-making under pressure.

The reason teams skip decision logs is the cost of capture, and that’s exactly the cost AI collapses. Let the model be the historian so your responders can stay decision-makers, and so the reasoning behind every hard call survives to teach the next one. Explore more incident response practice, and find scribe and decision-capture prompts in the prompt library and prompt packs.