Guarded Runbook Execution: AI Drafts, Humans Approve

We had the runbook open. We knew the fix — scale the read replicas, fail traffic to the healthy region, drain the bad pool. And we still lost six minutes, because the runbook said “scale the replica group” and the actual command needed the right resource name, the right region, and a confidence none of us quite had at 3 a.m. that we were about to run it against the correct cluster. Remediation time isn’t only “do we know the fix.” It’s the keystroke-and-recall tax of composing exact, correct, safe commands under pressure — and that tax is recoverable.

The tempting answer is auto-remediation: let the system run the runbook. The right answer for most teams is narrower and safer: let AI draft each command, perfectly filled in, behind a human approval gate. You get the speed of not composing from memory and you never hand the model the keys.

The danger isn’t the runbook, it’s the assumptions

Fully automated runbook execution fails for a specific reason: the runbook encodes assumptions about the world, and the world may not match. The resource was already deleted by a previous attempt. The deploy already rolled back. The namespace is different in this region. A runbook that blindly runs “delete and recreate” against a world that’s already half-fixed can turn one incident into two. This is the same propose-then-verify discipline that the rest of the MTTR category leans on — speed with a check, not speed instead of a check.

So the unit of safety isn’t the plan, it’s the step — and every mutating step needs a read-only pre-flight that confirms the assumption still holds before you act.

Ask for one guarded step at a time

The framing forces the gate. You don’t ask for the whole runbook translated; you ask for the next step, with a check in front of it.

You are pairing with me on remediation. For each runbook step, in order, give me: (1) a read-only pre-flight command that confirms the system is in the state this step expects, and what output means stop; (2) the exact mutating command with my namespace/region/resource filled in, plus its blast radius; (3) the verify command; (4) the rollback command. End every step with “confirm the pre-flight passed and tell me to proceed” and do not output the next step until I do. Flag any step whose assumption looks stale against the current state I gave you.

A step comes back like this:

Step 2 — drain the unhealthy connection pool Pre-flight (read-only): kubectl get pods -n payments -l pool=primary -o wide — proceed only if the primary pods show Running and the secondary pool exists; stop if secondary is missing. Action: kubectl scale deploy/payments-secondary -n payments --replicas=6 — adds 6 pods to the secondary pool; blast radius: payments namespace only. Verify: kubectl rollout status deploy/payments-secondary -n payments Rollback: kubectl scale deploy/payments-secondary -n payments --replicas=2 Confirm the pre-flight passed and tell me to proceed.

You run the pre-flight. If it says stop, you’ve just avoided scaling a pool that doesn’t exist. If it says go, you run the action knowing the rollback is already in your buffer.

The pre-flight is the whole point

# Pre-flight: is the world actually in the state the step assumes?
kubectl get pods -n payments -l pool=primary -o wide

# Only after a clean pre-flight: the mutating action
kubectl scale deploy/payments-secondary -n payments --replicas=6

# Pre-staged rollback, ready to paste if verify fails
kubectl scale deploy/payments-secondary -n payments --replicas=2

The pre-flight is what separates this from “the AI ran the runbook.” It catches the two things that make automation dangerous: a placeholder the model filled with a plausible-but-wrong value, and a runbook step whose assumption no longer fits the system. Neither costs you anything but a read-only command, and either one would otherwise be the mistake that extends the incident.

AI proposes the command, the human owns the keystroke

The line I hold: the model never claims a step succeeded, never chains mutating actions, and never advances without my word. Its output is a draft of the command, not an action. That distinction is what makes this usable in production where genuine auto-remediation would be reckless.

Rules that keep it safe:

Never skip the pre-flight, especially under pressure. The temptation to paste the action command directly is exactly when the stale-assumption mistake happens.
Keep the rollback in your buffer before you run the action. If verify fails, you want the undo to be one paste away, not something you compose while the outage grows.
Let the model flag stale runbooks. If the current state you pasted contradicts the runbook’s assumption, you want to hear “this step may not apply” before you run it.

You can rehearse this loop on the free incident assistant by pasting a runbook and current state and asking for the guarded, one-step-at-a-time plan. The prompt library has a hardened version with the pre-flight and stop-and-ask gates baked in so the structure does the enforcing.

Remediation gets faster when you stop composing commands from memory and start reviewing commands the AI drafts — but only if every mutating step waits behind a check you run and an approval you give. That’s the trade that actually cuts MTTR: the model’s speed at drafting, the human’s judgment on every keystroke that changes the system.

The danger isn’t the runbook, it’s the assumptions

Ask for one guarded step at a time

The pre-flight is the whole point

AI proposes the command, the human owns the keystroke

Download the Free 500-Prompt DevOps AI Toolkit