Learning From Near-Misses Before They Become Outages

Every outage you’ve ever had was preceded by a string of near-misses — the disk that hit 94% and then a cleanup job happened to run, the bad config that got caught in code review by luck, the cascading failure that stopped one hop short of customer impact because a timeout happened to be set right. Most teams celebrate these as “we got away with it” and move on. That’s a mistake. A near-miss is an outage that gave you the lesson for free, without the customer impact bill.

Industries where mistakes kill people — aviation, medicine, nuclear — built their entire safety records on systematically harvesting near-misses. Software has been slow to copy them, mostly because we only investigate things that actually broke. Here’s how to mine the failures that almost happened, which are far more numerous and just as instructive.

Why near-misses are the better teacher

A near-miss has a property a real incident doesn’t: nobody got hurt, so nobody’s defensive. The pressure, blame, and post-outage exhaustion that make real retrospectives hard are absent. You can investigate calmly, learn cleanly, and fix the latent problem before it costs anything.

Near-misses are also more frequent than incidents — often by an order of magnitude. If outages are the visible tip, near-misses are the mass of ice beneath. A team that only learns from outages is learning from the smallest, most expensive sample available. The near-misses are a bigger, cheaper dataset sitting right there.

What actually counts as a near-miss

Define it broadly so people report generously. A near-miss is any event where harm was narrowly avoided, especially by luck rather than design:

A resource (disk, memory, connection pool, queue) that came dangerously close to exhaustion and recovered.
A bad change caught late — in staging, in review, by a canary — that would have caused an incident in production.
A dependency failure that didn’t propagate only because of a fallback nobody knew was load-bearing.
A page that fired, was investigated, and turned out to be one config value away from a real outage.
“I had a bad feeling and checked, and it was about to break.”

The unifying thread: it didn’t become an incident, but the conditions for one were present. If the only thing standing between you and an outage was luck, that’s a near-miss worth capturing.

The hard part: making reporting safe and easy

You cannot harvest near-misses if people don’t report them, and people won’t report them if it’s a hassle or if it makes them look careless. Two barriers to remove:

The blame barrier. If reporting “I almost took down prod” gets someone scrutinized, nobody reports. Near-miss reporting has to be explicitly blameless and, ideally, celebrated — the person who surfaces a latent failure did the team a favor. Praise it out loud.
The friction barrier. A near-miss report should take two minutes, not thirty. A simple form, a Slack command, a quick channel post. If it requires a formal document, you’ll get nothing. Capture first; investigate the ones that matter later.

A lightweight capture template:

What almost happened: [the outcome you narrowly avoided] What saved us: [luck? a fallback? someone catching it? what exactly?] How close was it: [minutes? one config value? one retry?] Would it have been customer-impacting: [yes/no/maybe + why] Latent issue exposed: [the underlying gap]

Triage: not every near-miss deserves a full investigation

You’ll collect more near-misses than you can investigate, and that’s fine. Triage by two questions: how likely is this to recur and how bad would it have been. The high-likelihood, high-impact ones get a real look — often a lightweight version of an incident retro. The rest get logged so you can spot patterns.

Patterns are where the real value hides. Three near-misses about connection-pool exhaustion across different services isn’t three small notes — it’s one systemic capacity problem shouting at you before it becomes a SEV1. The aggregate signal is often louder than any single near-miss.

Wiring near-misses into your existing process

You don’t need a separate program; you need a few hooks into what you already do:

In retrospectives, ask not just “what happened” but “what near-misses preceded this that we ignored?” Most outages have a paper trail of dismissed warnings.
In your metrics review, track near-miss volume and themes alongside incident metrics. A rising near-miss count in one area is an early warning that an incident is coming.
In gamedays, treat a near-miss as a finding. If a chaos experiment almost cascaded, that’s the same free lesson — capture it.

The cultural payoff

There’s a second-order benefit beyond the specific fixes. A team that openly reports and discusses near-misses is a team where people feel safe surfacing problems early — which is exactly the culture that catches the next problem before it lands. The act of harvesting near-misses builds the psychological safety that prevents incidents in the first place. It’s a flywheel: report freely, fix latent issues, fewer outages, more trust, report more freely.

The teams I’ve seen with the best reliability records weren’t the ones with the fewest incidents by luck. They were the ones who treated every “phew, that was close” as a gift and spent it before it expired.

Start this week

You don’t need a platform. Open a channel, post the capture template, and tell the team that “I almost broke prod and here’s what saved us” is a story you want to hear, not one to hide. Investigate the scary ones, log the rest, and watch for patterns. The lessons are already happening around you — near-miss harvesting is just the discipline of writing them down before the customer pays for them.

We keep near-miss capture and pattern-tracking templates in our incident-response toolkit — because the cheapest incident to fix is the one that hasn’t happened yet.

Near-miss triage and prioritization are judgment calls. Calibrate what’s worth investigating against your own systems, risk tolerance, and history.