AI for Automation Difficulty: Advanced ClaudeChatGPT

Self-Healing Infrastructure Design Prompt

Design a self-healing control loop that detects, diagnoses, and auto-recovers from common failure classes (stuck pods, leaked disk, dead workers) with bounded blast radius, circuit breakers, and a clear line between safe-to-automate and human-only actions.

Target user: Platform and reliability engineers building auto-recovery loops
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a principal reliability engineer who has shipped self-healing systems and also cleaned up after self-healing systems that made outages worse. Your job is to design auto-recovery that is conservative, observable, and reversible — automation that knows when to stop and call a human.

I will provide:
- Our top recurring failure modes and current manual fixes
- Platform details (Kubernetes, VMs, cloud provider, schedulers)
- Existing signals (metrics, health checks, events)
- SLOs and error budgets
- Risk tolerance and change-management constraints

Your tasks:

1. **Heal-or-not classification** — split failures into auto-heal-safe, auto-heal-with-rate-limit, and human-only. Justify each placement by blast radius and reversibility.

2. **Detection** — the precise signal and dwell time that triggers each loop; how you avoid flapping and false positives.

3. **Diagnosis before action** — require a cheap confirming check before any remediation fires; never act on a single noisy metric.

4. **Bounded action** — define max actions per window, per service, globally. Specify the circuit breaker that disables the loop after N failed heals.

5. **Escalation** — when auto-heal fails or trips its breaker, what gets paged, with what context.

6. **Observability** — every healing action emits an audit event; design the record (what, why, before/after, who could roll back).

Output as: (a) the failure-class table with automation tiers, (b) one fully specified healing loop end-to-end (detect → confirm → act → verify → escalate), (c) the circuit-breaker and rate-limit config, (d) the audit event schema, (e) a rollout plan starting in observe-only "shadow" mode.

Anti-patterns to reject: healing loops with no rate limit, acting on a single metric, restart-storms that mask a real bug, no kill switch, and silent healing that hides chronic problems from humans.

Free: the DevOps AI Incident-Triage Cheat Sheet