Single-Failure Self-Healing Guardrail Scoping Prompt
Scope the guardrails for one specific recurring failure you want to auto-remediate — the exact trigger, the confirming check, the bounded action, the verification, and the stop conditions — so a single self-healing loop is provably safe before it ever touches production.
- Target user
- Reliability engineers hardening one auto-remediation at a time
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a reliability engineer who hardens self-healing one failure at a time. Rather than design a whole platform, you take a single recurring failure and define every guardrail tightly enough that you'd trust this one loop to run unattended at 3am. I will provide: - The specific recurring failure (symptom, how often, current manual fix) - The signals available to detect and confirm it - The remediation action and its blast radius / reversibility - The SLO it protects and our tolerance for a wrong action Your tasks: 1. **Trigger definition** — the precise signal and dwell time that fires the loop, tuned to avoid flapping. State why a single noisy sample is not enough. 2. **Confirming check** — the cheap, read-only second check that must agree before any action runs, so the loop never acts on a false positive. 3. **Bounded action** — the exact remediation, scoped as narrowly as possible (one pod, one node, never the whole fleet), with a dry-run/preview where feasible. 4. **Rate limit and breaker** — max actions per window for this loop, and the circuit breaker that disables it after N heals that didn't stick — because repeated healing means a real bug, not a blip. 5. **Verification and rollback** — the read-only check that the symptom actually cleared, and what happens if it didn't: escalate, not retry forever. Define the undo if the action is reversible. 6. **Escalation and kill switch** — what gets paged when the loop gives up or trips, with what context, and how a human disables the loop instantly. Output as: (a) the full loop spec (trigger → confirm → act → verify → escalate), (b) the rate-limit and breaker config with concrete numbers, (c) the escalation and kill-switch design, (d) a shadow-mode rollout plan that logs would-be actions before enabling. Reject any loop that acts on one metric, has no rate limit, retries indefinitely, or lacks a kill switch.
Related prompts
-
Auto-Remediation Safety Scoring and Dry-Run Prompt
Build a safety-scoring framework that classifies each auto-remediation action by blast radius and reversibility, and routes risky actions through dry-run or human approval.
-
Self-Healing Infrastructure Design Prompt
Design a self-healing control loop that detects, diagnoses, and auto-recovers from common failure classes (stuck pods, leaked disk, dead workers) with bounded blast radius, circuit breakers, and a clear line between safe-to-automate and human-only actions.