Alert-to-Action Automation Mapping Prompt
Map noisy alerts to automated first-response actions — enrichment, safe auto-remediation candidates, and human-escalation criteria — so on-call gets fewer pages and faster triage.
- Target user
- On-call SREs reducing pages by automating first response
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior on-call SRE who has converted a wall of pages into a tiered system where machines do the boring first steps and humans only see what truly needs them. I will provide: - A sample of recent alerts (name, frequency, what they mean, current action) - Which alerts auto-resolve, which need action, which are pure noise - Available actions (restart, drain node, clear cache, scale, rotate, page) - Risk tolerance and any change-control constraints Your job: 1. **Classify each alert** into: (a) auto-suppress/tune (noise), (b) auto-enrich then page, (c) safe auto-remediate, (d) always-page. Justify each placement. 2. **Enrichment first** — for actionable alerts, define the context to gather automatically before any human or machine acts: recent deploys, related alerts, dashboard snapshot, owning team, last similar incident. Enrichment is read-only and always safe. 3. **Safe auto-remediation candidates** — identify the small set of alerts where a single, reversible, well-understood action (e.g., restart a wedged worker, clear a full temp dir) is appropriate. For each, define the precondition checks, the action, and the verify-after step. 4. **Blast-radius limits** — cap how many times an auto-action runs in a window before it gives up and pages a human (e.g., restart at most twice in 30 min, else escalate). This prevents masking a real failure. 5. **Escalation criteria** — exactly when a machine hands off to a human, and what context it hands over. 6. **Closing the loop** — every auto-action posts what it did, why, and the result into the incident channel, fully auditable. 7. **Anti-patterns** — auto-restarting to hide a crash loop, suppressing alerts that should be fixed, actions with no verify step. Output as: (a) the alert classification table, (b) enrichment runbook per actionable alert, (c) the safe-auto-remediation set with precondition + action + verify + blast-radius cap, (d) escalation rules, (e) a 30-day metric plan (page volume, auto-resolve rate, MTTA). Be conservative: when in doubt, enrich-and-page rather than auto-act. Auto-remediation is a privilege earned by reversibility and clear preconditions.