Building Incident Runbooks Engineers Actually Trust at 3 AM
Most runbooks rot or get ignored mid-incident. Here's how to write runbooks that hold up under pressure, keep them current, and use AI to draft and audit them.
- #incident-response
- #runbooks
- #sre
- #on-call
- #automation
- #documentation
A runbook is a promise you make to your future self at 3 AM, when you’re half-awake and the part of your brain that knows this system is offline. Most runbooks break that promise. They’re either out of date, written for someone who already knows the answer, or buried in a wiki nobody can find while the pager is screaming.
After 25 years of being the person reading runbooks at 3 AM — and writing the ones other people cursed at — here’s what actually makes them hold up.
Write for the tired stranger, not the author
The single most common runbook failure: it’s written by the person who built the system, for the person who built the system. It says “restart the ingest service” and assumes you know which of the four ingest services, on which host, and how to tell if it worked.
Write every step for someone who has never seen this system, is exhausted, and is under pressure. That means:
- Exact commands, copy-pasteable, with the real service and host names.
- Expected output, so they know whether it worked.
- What to do if it didn’t work — the branch, not a dead end.
The structure that survives pressure
Each runbook should answer five questions fast:
1. When does this apply?
Lead with the trigger: the exact alert name or symptom. “Use this when CheckoutLatencyHigh fires” lets someone match their alert to the right page in two seconds.
2. How bad is it / how do I confirm?
The read-only diagnostic commands that confirm you’re in the right runbook. Confirmation before action — pulling the wrong lever is how a SEV3 becomes a SEV1.
3. How do I mitigate?
The fastest safe action to stop customer pain, clearly separated from root-cause fixing. At 3 AM you mitigate first and diagnose later.
4. How do I fix it?
The actual remediation steps, ordered, with verification after each.
5. Who do I escalate to?
Named role and how to reach them when the runbook runs out. A runbook that ends with “if this doesn’t work, good luck” is incomplete.
Keep diagnostics and destructive actions visually separate
Mark every command’s blast radius. I literally label them in the runbook:
- [SAFE] read-only:
kubectl get,journalctl,promtool query - [CAUTION] small change or shell-in:
kubectl exec, config edits - [DESTRUCTIVE] restarts, deletes, scaling, failovers, migrations
The tired brain reaches for the nearest command. If the destructive ones are visually flagged and placed after the diagnostics, you build in a pause exactly where you need it.
Fight rot with ownership and execution
Runbooks rot the moment the system changes. Two habits keep them honest:
Assign an owner per runbook. Not a team — a name. Ownership rotates, but at any moment one person is accountable for it being correct.
Execute them on purpose. During gamedays, or even a quiet on-call shift, actually run the runbook step by step. Every stale command, renamed service, and missing permission surfaces immediately. A runbook that’s never executed in calm conditions will fail in a crisis.
I also add a “last verified” date to the top of every runbook. If it’s older than a quarter, treat it as suspect.
Where AI speeds this up
Writing runbooks is tedious, which is why they don’t get written. AI removes most of that friction.
Drafting from a resolved incident. Right after an incident, paste the command history and ask:
“Turn this incident’s diagnostic and remediation commands into a runbook. Group into Confirm, Mitigate, and Fix sections. Label each command SAFE, CAUTION, or DESTRUCTIVE. Note expected output for the diagnostic commands.”
You capture the runbook while the knowledge is fresh, instead of promising to write it later and never doing it.
Auditing for gaps. Paste an existing runbook and ask the model to find missing rollback steps, commands without verification, or destructive actions that aren’t flagged. It’s a fast second set of eyes.
Filling the verification gaps. Ask it what the expected output of each diagnostic command should look like, then sanity-check against your real system.
One firm rule: AI drafts, humans verify against the real system. A model will confidently invent a flag or a metric name that doesn’t exist in your environment. Run every generated command in a safe context before it goes in the book.
We keep incident-response prompts for runbook drafting, and the Incident Response tool produces risk-classified, safest-first command plans you can lift straight into a runbook.
The minimum that beats nothing
You don’t need a polished runbook library to start. For your top three alerts, write down: the trigger, two confirmation commands, one mitigation, and who to escalate to. That single page, findable and current, will save a future you who is too tired to think clearly. Build from there.
Generated runbook steps are assistive. Verify every command in your own environment before relying on it during an incident.