Writing Up Near-Misses With AI: The Incident That Almost Happened
A near-miss is a free incident — the lesson without the outage. Here's how to use AI to write them up fast, blamelessly, and catch the latent risk early.
- #postmortems
- #postmortem
- #ai
Someone on your team had a moment last week that went something like: “Wait — did I just run that against prod? … No. Okay. Phew.” And then they moved on, and nothing was written down, and the thing that made it possible to almost run a destructive command against prod is still sitting there, waiting for the next person who doesn’t catch it in time. That’s a near-miss, and it’s the cheapest incident your organization will ever have access to — you got the entire lesson without the outage, the customer impact, or the 3 a.m. page.
The problem is structural: near-misses don’t trigger your incident process, because nothing broke. There’s no Sev to declare, no postmortem to schedule, no pressure to capture anything. So the close call evaporates and the latent risk survives. Building a habit of writing near-misses up — fast, lightly, and blamelessly — is one of the highest-leverage reliability moves a team can make, and AI is well-suited to it precisely because the writeup needs to be nearly free or nobody will do it.
The one question that matters most
Most “phew, that was close” reactions skip the single most important judgment a near-miss demands: was the save robust or fragile? Did a designed guardrail do exactly its job, or did you get lucky?
These look identical in the moment — nothing broke either way — but they are opposite situations. A near-miss averted because a confirmation prompt made the engineer pause, or because a permission boundary blocked the destructive command, is reassurance: your controls worked. A near-miss averted because someone happened to glance at the right terminal at the right second, or because the timing coincidentally lined up, is a near-certain future incident: nothing actually stopped it, you just rolled the dice and won this time. Treating those the same is the core mistake, and naming which one you’re holding is the whole point of writing it down.
Write up this near-miss briefly and blamelessly.
1. One-paragraph summary: the close call, the impact that was
avoided, and what saved you.
2. SAFETY MARGIN: was the save ROBUST (a designed guardrail did
its job) or FRAGILE (luck, someone happened to look, a timing
coincidence)? This distinction is the point.
3. LATENT RISK: the underlying condition that made the near-miss
possible and is still present.
4. 1-3 lightweight follow-ups, proportional to a near-miss —
usually close the latent gap or harden a fragile save into a
reliable one. Don't over-engineer a response to something that
didn't break.
5. Note if this resembles any pattern worth watching if it recurs.
Rules: Celebrate the catch, never imply the near-miss was someone's
fault. Keep the response proportional. Mark unverified specifics
[UNVERIFIED].
What almost happened / why it didn't / how it was noticed: <paste>
What a good near-miss writeup looks like
It’s short by design. After a human confirms the specifics:
Near-miss: almost dropped a prod table during a migration test
Summary: An engineer running a migration dry-run in what they believed was staging was actually pointed at a prod replica due to a stale shell environment. They caught it when the row count in the output looked too large, and stopped before the destructive step. No data was lost.
Safety margin: FRAGILE. Nothing prevented this. The save depended entirely on the engineer noticing an anomalous row count and connecting it to “I might be on the wrong database.” A less experienced engineer, or the same one while more tired, would not have caught it. We got lucky.
Latent risk: there is no visual or enforced distinction between a staging and a prod database connection in our tooling, and destructive migration commands have no environment confirmation. The stale-environment failure mode is fully reproducible.
Follow-ups (proportional):
- Add a prod-environment confirmation prompt to the migration tool’s destructive operations. (Hardens the fragile save into a real guardrail.)
- Make the shell prompt show the active environment in a distinct color for prod.
Notice the writeup didn’t spawn a full postmortem’s worth of process. Two targeted follow-ups, both aimed at converting that fragile, luck-based save into something that actually stops the next person. That’s the right size.
Keep it proportional or it dies
The fastest way to kill near-miss reporting is to make it expensive. The instant that writing up a close call means scheduling a review, filling a six-section template, and assigning five action items, people stop reporting them — they’ll just quietly think “phew” and move on, and you’re back to the latent risk surviving in silence. The prompt is deliberately instructed to keep the response proportional: a near-miss does not warrant a real incident’s machinery, and over-engineering a response to something that didn’t actually break both wastes effort and teaches everyone that reporting a close call is a bureaucratic punishment. The whole value of near-misses is that they’re cheap; a heavy process throws that away.
Blameless matters even more here
In a normal postmortem, blameless framing protects honest analysis. In a near-miss, it protects the reporting itself — and that’s even more fragile, because there’s no incident forcing the writeup. If reporting “I almost dropped a prod table” feels like confessing a mistake that goes on your record, nobody will ever do it, and you lose the cheapest learning channel you have. A near-miss is a catch to celebrate, not a near-failure to pin on whoever almost caused it. The engineer who noticed the row count and stopped did exactly the right thing; the writeup should read as “look what our system let us nearly do,” not “look what this person nearly did.” The prompt keeps that framing, because the moment it slips, the reports stop.
The human decides what to log
The model produces a fast, proportional, blameless draft. A person decides whether it’s worth logging at all, confirms the specifics, and makes the fragile-vs-robust call the model proposed — because that judgment occasionally needs context the model doesn’t have. The point isn’t to formalize every “phew.” It’s to make capturing the ones that matter nearly free, so the latent risks get caught while they’re still free lessons instead of future outages.
The near-miss prompt is in the prompts library, and it pairs naturally with pre-mortem failure-mode brainstorming — pre-mortems imagine the failures before they happen, near-miss writeups capture the ones that almost did. Both are about learning from incidents that never made it to the incident channel.
The cheapest incident is the one that almost happened. Write it down before luck runs out.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.