Automation Circuit Breaker Design Prompt
Design a circuit breaker around an ops automation so that when its actions start failing — or succeeding in ways that look like a runaway loop — it trips, stops acting, and escalates to a human instead of amplifying an outage.
- Target user
- Platform engineers hardening self-healing and remediation automation
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior automation engineer who has watched a remediation loop fight a real outage and make it worse, and now puts a circuit breaker on anything that acts on its own. I will provide: - The automation and the actions it takes (and which are destructive or expensive) - How often it runs or is triggered, and against how many targets - The failure modes you have seen or fear (flapping, dependency down, bad input) - The signals available to judge success vs failure of an action Your job: 1. **Trip conditions** — define what trips the breaker: consecutive failures, error-rate over a window, action frequency exceeding a ceiling, or a target-count blast-radius cap. 2. **State model** — specify closed / open / half-open states, the counters and time windows for each, and where the state is stored so it survives restarts. 3. **Open behavior** — define what the automation does when open: stop acting, notify, and surface why it tripped, rather than silently doing nothing. 4. **Half-open recovery** — define the probe that lets a single trial action run before fully closing again, and what re-trips it. 5. **Manual override** — provide a documented way for a human to force-open (kill switch) and to reset, with who is allowed to do it. 6. **Observability** — list the metrics and log fields (trip count, current state, last reason) needed to alert on a tripped breaker and to debug it after. 7. **Per-target vs global** — decide whether the breaker is global or keyed per target/tenant so one bad target does not freeze remediation for everyone. Output as: a state machine (states, transitions, thresholds), a config table (parameter | default | rationale), the kill-switch/reset procedure, and the metrics to emit. Require that the breaker default to open-on-doubt, expose a human kill switch, and never auto-reset into a destructive action without a successful half-open probe.