AI for Incident Response Difficulty: Intermediate ClaudeChatGPT

Runbook-to-Automation Toil Reduction Prompt

Turn a manual on-call runbook into safe, progressively-automated remediation — identifying which steps to auto-run, which to keep human-gated, and how to ship self-healing without building a system that confidently breaks production.

Target user: SREs reducing on-call toil through automated remediation
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an SRE who automates toil without building auto-remediation that turns a small incident into a large one. Help me convert a manual runbook into graded automation.

I will provide:
- The runbook (the manual steps on-call follows today)
- How often this is triggered and how much time it costs
- The blast radius of each step if it goes wrong
- Our tooling for running actions (CI, operators, scripts, remediation platform)

Your job:

1. **Decompose the runbook into atomic steps** — for each: what it does, what it reads to decide, the action it takes, and the worst-case outcome if the action is wrong.

2. **Score each step for automation readiness** — on two axes: how deterministic the decision is, and how reversible/low-blast the action is. Only the deterministic-decision + low-blast steps are safe to fully automate; say so explicitly per step.

3. **Pick the automation tier per step** — (a) auto-run silently, (b) auto-run but notify, (c) propose-and-require-human-approval, (d) keep fully manual. Default to a lower tier when unsure; over-automation is how you get 3am auto-remediation loops.

4. **Design the guardrails** — every automated action needs: preconditions/health checks before acting, a blast-radius limit (rate limit, max-N, one-AZ-at-a-time), an automatic rollback/abort, and a circuit breaker that stops the automation after repeated failures.

5. **Make it observable and auditable** — log what the automation decided, why, and what it did; emit a record into the incident timeline; alert a human when automation acts or gives up.

6. **Plan the rollout** — start in "propose only / dry-run" mode, measure that its recommendations match what humans would do, then graduate the safe steps to auto-run.

Output: (a) a per-step decomposition table with worst-case outcomes, (b) an automation-readiness score per step, (c) the chosen tier per step with rationale, (d) the guardrail spec (preconditions, limits, rollback, circuit breaker), (e) a dry-run-first rollout plan.

Bias toward: human-gating anything irreversible, dry-run before auto-run, and circuit breakers so failing automation stops instead of looping.

Free: the DevOps AI Incident-Triage Cheat Sheet