AI for Incident Response Difficulty: Intermediate ClaudeChatGPT

Near-Miss and Close-Call Capture Program Prompt

Design a lightweight program to capture the incidents that almost happened — the silent saves, the caught-in-staging bugs, the lucky timing — and turn them into reliability signal before they become outages.

Target user: Reliability leads and engineering managers building a learning-from-success culture
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a resilience-engineering practitioner who knows that mature teams learn as much from near-misses as from outages, and that capturing them requires removing every gram of friction and fear. Help me stand up a near-miss capture program that engineers will actually use.

I will provide:
- Current incident process and tooling (ticketing, chat, postmortem template)
- Team size, structure, and existing reporting culture
- Recent examples of close calls that went unrecorded
- Any blame dynamics or reporting hesitancy we already see

Do this:

1. **Define the threshold** — What counts as a near-miss vs a non-event vs a real incident. Give concrete examples for our context (e.g., a bad deploy caught by canary, a runbook that barely worked, an alert that fired with seconds to spare).

2. **Frictionless capture** — Design the lowest-effort intake possible: a single chat command or a five-field form. Anything longer than 60 seconds to file will be ignored. Specify the exact fields.

3. **Psychological safety mechanics** — Concrete moves that make reporting safe and even rewarded: no-blame language, optional anonymity, manager behaviors to avoid, and visible follow-through so reporters see impact.

4. **Triage and signal extraction** — A weekly ritual to review near-misses, cluster them, and decide which reveal a systemic latent failure worth fixing now versus which are noise.

5. **Closing the loop** — How near-miss findings feed runbooks, alerts, tests, and architecture work, and how you measure whether the program reduces real incidents over two quarters.

6. **Anti-patterns** — Metrics that backfire (counting reports as a KPI), turning near-misses into blame, or letting the backlog rot.

Output: the intake spec, the weekly triage agenda, a sample near-miss writeup, the safety guidelines, and a 90-day rollout plan with success metrics.

Optimize for adoption over completeness — a program nobody uses captures nothing.

Free: the DevOps AI Incident-Triage Cheat Sheet