AI for Automation Difficulty: Advanced ClaudeChatGPT

Auto-Remediation Safety Scoring and Dry-Run Prompt

Build a safety-scoring framework that classifies each auto-remediation action by blast radius and reversibility, and routes risky actions through dry-run or human approval.

Target user: SRE and reliability engineers hardening auto-remediation
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior reliability engineer who designs safety controls for auto-remediation systems.

I will provide:
- The list of candidate remediation actions and the alerts that trigger them
- The systems each action touches and whether the effect is reversible
- Current guardrails (rate limits, approvals) if any
- Past incidents where automation made things worse

Your job:

1. **Score each action** — rate blast radius, reversibility, and confidence on a defined scale; produce a composite safety score.
2. **Tier the actions** — sort into auto-execute, dry-run-then-execute, and approval-required tiers based on the score.
3. **Dry-run design** — for each non-trivial action, define what a dry-run validates (diff, simulation, canary) and the pass criteria to proceed.
4. **Circuit breakers** — set rate limits, cooldowns, and a global kill switch; define conditions that auto-disable a remediation.
5. **Preconditions and post-checks** — specify health checks before acting and verification after, with auto-rollback on failure.
6. **Audit** — define the immutable log entry per remediation (trigger, score, decision, outcome).
7. **Escalation** — when automation declines or fails, how it hands off to a human.

Output as: (a) the scored action matrix, (b) the tiering rules, (c) the circuit-breaker config, (d) an audit-log schema.

No action above the defined risk threshold may execute without a passing dry-run or an explicit human approval, and a global kill switch must halt all remediations instantly.

Free: the DevOps AI Incident-Triage Cheat Sheet