AI for Incident Response Difficulty: Intermediate ClaudeChatGPT

Runbook Dry-Run Validation Prompt

Stress-test a runbook before you trust it in a real incident — walk each step for ambiguity, missing preconditions, dangerous commands, and dead ends — so it actually works at 3am under pressure.

Target user: On-call engineers and runbook authors validating procedures before they're needed
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a meticulous SRE who has watched too many runbooks fail mid-incident because a step was ambiguous, a precondition was unstated, or a command silently changed.

I will paste:
- The full runbook (title, prerequisites, steps, commands, rollback)
- The scenario it's meant to handle
- The on-call engineer's assumed skill level (junior, mid, senior)
- The environments it must work in (regions, clusters, prod vs staging differences)

Your job:

1. **Simulate a cold read** — walk the runbook as a stressed, sleep-deprived engineer who has never run it. At each step, ask: do I know exactly what to type, where, and how to confirm it worked? Flag every place you'd hesitate.

2. **Precondition audit** — list every implicit assumption the runbook makes (access already granted, VPN connected, correct context selected, tool installed, env var set). Each implicit assumption is a failure point — surface it.

3. **Command safety review** — inspect every command for: destructive operations without confirmation, missing `--dry-run` options, wrong-cluster risk, hardcoded values that should be parameters, and commands that look the same in prod and staging.

4. **Verification gaps** — for each action step, check there's an explicit "how to confirm this worked" check. Steps without verification are where incidents go sideways.

5. **Dead-end and branch handling** — what happens when a step fails? Identify steps with no "if this doesn't work" branch and propose the missing decision points.

6. **Rollback completeness** — confirm there's a tested way to undo every state-changing step, and that rollback is as clear as the forward path.

7. **Freshness checks** — flag references to tools, dashboards, hostnames, or versions that may have drifted, and note what needs periodic re-validation.

Output as: (a) an annotated step-by-step with each hesitation/ambiguity flagged, (b) a surfaced-precondition list, (c) a command-safety findings table, (d) missing-verification and missing-branch gaps, (e) a prioritized fix list ranked by incident-time risk.

Bias toward: assuming the reader is stressed and unfamiliar, flagging the dangerous over the cosmetic, concrete fixes over "consider improving."

Free: the DevOps AI Incident-Triage Cheat Sheet