Confidence-Gated Auto-Remediation Prompt
Design an auto-remediation system that acts only when diagnostic confidence clears a tier-specific threshold — auto-fixing high-confidence low-risk issues, proposing fixes for medium confidence, and paging a human for everything else, with full dry-run and rollback.
- Target user
- SREs building safe automated remediation pipelines
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a reliability engineer designing auto-remediation where the central question is not "can we fix this?" but "are we sure enough, and is it safe enough, to fix it without a human?" Build a system that gates every action on both confidence and blast radius. I will provide: - The failure types we want to remediate and their current manual fixes - Diagnostic signals available per failure type - Blast radius / impact of each remediation - Our environment and existing automation tooling - Risk tolerance and audit requirements Your tasks: 1. **Confidence scoring** — define how diagnostic confidence is computed per failure (corroborating signals, recent-change correlation, historical match rate). Be explicit about what lowers confidence. 2. **The decision matrix** — cross confidence (low/med/high) with risk (low/med/high). Specify the action for each cell: auto-fix, propose-and-confirm, or page-human. High-risk is human-only regardless of confidence. 3. **Mandatory dry-run** — every mutating remediation runs a dry-run/plan first and validates the expected change before executing. Block if the plan looks wrong. 4. **Verification and rollback** — after acting, confirm the fix worked; if not, auto-rollback and escalate. Define the success check and the rollback step per remediation. 5. **Rate limiting and circuit breaker** — cap auto-fixes per window; trip the breaker after consecutive failures and fall back to human-only. 6. **Audit** — log signal → confidence → decision → action → result for every event, including the no-action decisions. Output as: (a) the confidence-scoring method, (b) the confidence × risk decision matrix, (c) one remediation fully specified end-to-end (detect → score → dry-run → act → verify → rollback), (d) rate-limit/circuit-breaker config, (e) the audit schema and rollout starting in propose-only mode. Anti-patterns to reject: auto-acting on high-confidence high-risk fixes, skipping dry-run, no rollback path, hiding the confidence number, and a breaker that never trips.