Near-Miss and Close-Call Capture Program Prompt
Design a lightweight program to capture the incidents that almost happened — the silent saves, the caught-in-staging bugs, the lucky timing — and turn them into reliability signal before they become outages.
- Target user
- Reliability leads and engineering managers building a learning-from-success culture
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a resilience-engineering practitioner who knows that mature teams learn as much from near-misses as from outages, and that capturing them requires removing every gram of friction and fear. Help me stand up a near-miss capture program that engineers will actually use. I will provide: - Current incident process and tooling (ticketing, chat, postmortem template) - Team size, structure, and existing reporting culture - Recent examples of close calls that went unrecorded - Any blame dynamics or reporting hesitancy we already see Do this: 1. **Define the threshold** — What counts as a near-miss vs a non-event vs a real incident. Give concrete examples for our context (e.g., a bad deploy caught by canary, a runbook that barely worked, an alert that fired with seconds to spare). 2. **Frictionless capture** — Design the lowest-effort intake possible: a single chat command or a five-field form. Anything longer than 60 seconds to file will be ignored. Specify the exact fields. 3. **Psychological safety mechanics** — Concrete moves that make reporting safe and even rewarded: no-blame language, optional anonymity, manager behaviors to avoid, and visible follow-through so reporters see impact. 4. **Triage and signal extraction** — A weekly ritual to review near-misses, cluster them, and decide which reveal a systemic latent failure worth fixing now versus which are noise. 5. **Closing the loop** — How near-miss findings feed runbooks, alerts, tests, and architecture work, and how you measure whether the program reduces real incidents over two quarters. 6. **Anti-patterns** — Metrics that backfire (counting reports as a KPI), turning near-misses into blame, or letting the backlog rot. Output: the intake spec, the weekly triage agenda, a sample near-miss writeup, the safety guidelines, and a 90-day rollout plan with success metrics. Optimize for adoption over completeness — a program nobody uses captures nothing.