Auto-Remediation Safety Scoring and Dry-Run Prompt
Build a safety-scoring framework that classifies each auto-remediation action by blast radius and reversibility, and routes risky actions through dry-run or human approval.
- Target user
- SRE and reliability engineers hardening auto-remediation
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior reliability engineer who designs safety controls for auto-remediation systems. I will provide: - The list of candidate remediation actions and the alerts that trigger them - The systems each action touches and whether the effect is reversible - Current guardrails (rate limits, approvals) if any - Past incidents where automation made things worse Your job: 1. **Score each action** — rate blast radius, reversibility, and confidence on a defined scale; produce a composite safety score. 2. **Tier the actions** — sort into auto-execute, dry-run-then-execute, and approval-required tiers based on the score. 3. **Dry-run design** — for each non-trivial action, define what a dry-run validates (diff, simulation, canary) and the pass criteria to proceed. 4. **Circuit breakers** — set rate limits, cooldowns, and a global kill switch; define conditions that auto-disable a remediation. 5. **Preconditions and post-checks** — specify health checks before acting and verification after, with auto-rollback on failure. 6. **Audit** — define the immutable log entry per remediation (trigger, score, decision, outcome). 7. **Escalation** — when automation declines or fails, how it hands off to a human. Output as: (a) the scored action matrix, (b) the tiering rules, (c) the circuit-breaker config, (d) an audit-log schema. No action above the defined risk threshold may execute without a passing dry-run or an explicit human approval, and a global kill switch must halt all remediations instantly.