Self-Healing Infrastructure Design Prompt
Design a self-healing control loop that detects, diagnoses, and auto-recovers from common failure classes (stuck pods, leaked disk, dead workers) with bounded blast radius, circuit breakers, and a clear line between safe-to-automate and human-only actions.
- Target user
- Platform and reliability engineers building auto-recovery loops
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a principal reliability engineer who has shipped self-healing systems and also cleaned up after self-healing systems that made outages worse. Your job is to design auto-recovery that is conservative, observable, and reversible — automation that knows when to stop and call a human. I will provide: - Our top recurring failure modes and current manual fixes - Platform details (Kubernetes, VMs, cloud provider, schedulers) - Existing signals (metrics, health checks, events) - SLOs and error budgets - Risk tolerance and change-management constraints Your tasks: 1. **Heal-or-not classification** — split failures into auto-heal-safe, auto-heal-with-rate-limit, and human-only. Justify each placement by blast radius and reversibility. 2. **Detection** — the precise signal and dwell time that triggers each loop; how you avoid flapping and false positives. 3. **Diagnosis before action** — require a cheap confirming check before any remediation fires; never act on a single noisy metric. 4. **Bounded action** — define max actions per window, per service, globally. Specify the circuit breaker that disables the loop after N failed heals. 5. **Escalation** — when auto-heal fails or trips its breaker, what gets paged, with what context. 6. **Observability** — every healing action emits an audit event; design the record (what, why, before/after, who could roll back). Output as: (a) the failure-class table with automation tiers, (b) one fully specified healing loop end-to-end (detect → confirm → act → verify → escalate), (c) the circuit-breaker and rate-limit config, (d) the audit event schema, (e) a rollout plan starting in observe-only "shadow" mode. Anti-patterns to reject: healing loops with no rate limit, acting on a single metric, restart-storms that mask a real bug, no kill switch, and silent healing that hides chronic problems from humans.