Knowing When to Roll Back Your Automation

The scariest outage I’ve been part of wasn’t caused by a bad deploy. It was caused by an auto-remediation script that was supposed to fix things. A node went unhealthy, the script cordoned and drained it — correct — then a flapping health check told it the replacement node was also unhealthy, so it drained that one too, and the one after that. By the time a human noticed, the automation had cheerfully evicted half the cluster, one confident decision at a time. Nobody had built it a way to say “stop.” That night I started treating automation the way I treat any other service in production: it has an SLO, it can fail, and it needs a way to be turned off.

The hard part of automation isn’t writing it. It’s knowing when the thing you wrote has started doing more harm than good — and having the discipline to disable it before it finishes the job. Automation is a fast junior engineer that never gets tired and never second-guesses itself, which is exactly why it needs a leash a human can yank.

Give your automation its own SLOs

You measure your services. Measure your automation the same way. The metrics that matter aren’t “did it run” — they’re about whether running it was a good idea. I track three:

False-fire rate — how often it acted when no action was warranted.
Recurrence rate — how often the same problem comes back after it “fixed” it.
Verified-success rate — not “the command exited 0,” but “the underlying condition actually cleared and stayed clear for N minutes.”

That last one is the one everybody skips. A script that restarts a pod and exits 0 reports success. Whether the pod is actually serving traffic ninety seconds later is a different question, and it’s the only one that counts.

def verified_success(remediation_id: str, check, *, settle_seconds: int = 120) -> bool:
    """Success = the condition cleared AND stayed clear after settling."""
    if not check():                       # cleared immediately?
        record_outcome(remediation_id, "no_immediate_effect")
        return False
    time.sleep(settle_seconds)
    if check():                           # still healthy after settling?
        record_outcome(remediation_id, "verified_success")
        return True
    record_outcome(remediation_id, "regressed")   # fixed, then broke again
    return False

The gap between exit-code success and verified success is where bad automation hides. Track both and the divergence tells you when to stop trusting a script. This is the same idea explored in confidence-gated auto-remediation: the bar for “it worked” has to be the outcome, not the exit code.

Build a kill switch you can hit from anywhere

Before any clever circuit breaker, you need the dumbest possible off switch: a flag that any human can flip, from a phone, that the automation checks on every single run. No deploy, no PR, no restart.

import requests

def automation_enabled(name: str) -> bool:
    # A flag in a config store the on-call can flip in seconds.
    try:
        r = requests.get(f"http://config/flags/{name}", timeout=2)
        return r.json().get("enabled", False)   # fail CLOSED
    except Exception:
        return False   # if we can't read the flag, do NOT act

def remediate(node: str):
    if not automation_enabled("node-auto-drain"):
        log.info("node-auto-drain disabled by kill switch; skipping")
        return
    drain(node)

Two things make this a real kill switch and not a placebo. It’s checked on every run, so flipping it stops the next action immediately. And it fails closed — if the automation can’t even read the flag, it does nothing. The default state of a confused automation should always be “sit still,” never “keep going.” That single inversion would have saved my half-drained cluster.

Pro Tip: Put the kill switch in the runbook’s first line, not its appendix. When automation is actively making things worse, “how do I turn this off” needs to be the fastest thing to find, not a thing you reverse-engineer from source under pressure.

Add circuit breakers so it stops itself

A human kill switch covers the case where someone’s watching. A circuit breaker covers the case where nobody is. If automation fails — or fires — too many times in a window, it trips itself open and refuses to act until a human resets it.

import time
from collections import deque

class CircuitBreaker:
    def __init__(self, *, max_fires: int, window_s: int, name: str):
        self.max_fires, self.window_s, self.name = max_fires, window_s, name
        self.events: deque[float] = deque()
        self.open_until: float = 0.0

    def allow(self) -> bool:
        now = time.time()
        if now < self.open_until:
            return False                       # tripped — refuse to act
        while self.events and now - self.events[0] > self.window_s:
            self.events.popleft()
        if len(self.events) >= self.max_fires:
            self.open_until = now + 3600        # trip for an hour, page a human
            page_oncall(f"circuit breaker {self.name} tripped: "
                        f"{len(self.events)} fires in {self.window_s}s")
            return False
        return True

    def record_fire(self):
        self.events.append(time.time())

# A remediation that fires more than 3 times in 10 minutes is a runaway.
breaker = CircuitBreaker(max_fires=3, window_s=600, name="node-auto-drain")

def remediate(node: str):
    if not breaker.allow():
        return
    breaker.record_fire()
    drain(node)

The threshold encodes a judgment: “if I’m draining nodes more than three times in ten minutes, something is wrong with my inputs, not the nodes.” A healthy cluster doesn’t need three drains in ten minutes. When the breaker trips, it doesn’t quietly retry — it pages a human and waits. The recurring fire is itself the signal that the automation has lost the plot.

Treat recurrence as “you’re masking the root cause”

Here’s the uncomfortable one. If your automation keeps successfully fixing the same problem, the automation isn’t the hero — it’s the thing hiding a real bug. A script that restarts a leaking service every four hours is buying you uptime and burning down your incentive to fix the leak. Recurrence should feel like a slow alarm, not a quiet win.

I alert on it directly. A remediation that fires for the same target on a regular cadence is a rollback candidate by definition — not because it’s broken, but because its success is letting a deeper problem fester.

groups:
  - name: automation-health
    rules:
      # Same remediation firing repeatedly = masking a root cause
      - alert: RemediationRecurrence
        expr: |
          sum by (remediation, target) (
            increase(remediation_fires_total[6h])
          ) > 4
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.remediation }} fired >4x in 6h on {{ $labels.target }}"
          description: "Automation is masking a recurring failure. Investigate root cause; consider disabling auto-remediation for this target."

      # Verified-success rate collapsing = automation no longer working
      - alert: RemediationVerifiedSuccessLow
        expr: |
          sum(rate(remediation_outcome_total{outcome="verified_success"}[1h]))
          /
          sum(rate(remediation_outcome_total[1h])) < 0.7
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Verified-success rate below 70% — automation is unreliable"

The second rule is the rollback trigger. When verified success drops below 70%, the automation is wrong more than it’s right, and the correct move is to disable it and go back to manual until you understand why. Self-healing infrastructure is a great goal, but building it with AI only works if “heal” means resolving the cause, not papering over the symptom on a loop.

Use AI to read the logs and nominate rollback candidates

Reviewing automation logs by hand is exactly the kind of toil worth handing off — see identifying and eliminating toil with AI for the broader pattern. The model is good at the pattern-spotting a human glazes over: this remediation always fires right after that deploy, this target recurs on a daily rhythm, this script’s verified-success rate has been quietly sliding for a week.

# Pull a structured digest, hand it to the model to triage. The model
# NOMINATES candidates — it never disables anything itself.
digest = summarize_automation_logs(window="7d")  # per-remediation outcome counts

prompt = f"""You are reviewing automation health logs. For each remediation,
flag any that look like rollback candidates and say why. Signals to weigh:
verified-success rate below 70%, recurrence on a fixed cadence (masking a root
cause), or a rising trend in regressions. Output a ranked list of candidates
with one-line reasoning each. Do NOT recommend changes to anything you have no
data for.

{digest}"""

candidates = model.review(prompt)        # returns a ranked list, nothing more
post_to_review_channel(candidates)       # a human reads it and decides

The model’s output is a nomination, not a decision. It posts a ranked list of “you might want to roll these back, here’s why” to a channel where a human reads it over coffee. The model never flips a flag, never trips a breaker, never disables a script. It’s a fast junior triaging the queue so the senior spends their attention where it matters. A human owns every rollback, and the audit trail shows who pulled which plug and when. If you want a place to draft and refine these triage prompts, the prompt workspace is built for exactly that iteration.

The whole point of rollback discipline is humility about your own automation. You wrote it; that doesn’t mean it’s right today. Give it an SLO, a kill switch, a circuit breaker, and a standing question — is this still helping? — and you’ll catch the runaway before it drains the second node, not after it’s drained the tenth.

Give your automation its own SLOs

Build a kill switch you can hit from anywhere

Add circuit breakers so it stops itself

Treat recurrence as “you’re masking the root cause”

Use AI to read the logs and nominate rollback candidates

Free: the DevOps AI Incident-Triage Cheat Sheet