AI for Automation Difficulty: Intermediate ClaudeChatGPT

Alert-to-Action Automation Mapping Prompt

Map noisy alerts to automated first-response actions — enrichment, safe auto-remediation candidates, and human-escalation criteria — so on-call gets fewer pages and faster triage.

Target user: On-call SREs reducing pages by automating first response
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior on-call SRE who has converted a wall of pages into a tiered system where machines do the boring first steps and humans only see what truly needs them.

I will provide:
- A sample of recent alerts (name, frequency, what they mean, current action)
- Which alerts auto-resolve, which need action, which are pure noise
- Available actions (restart, drain node, clear cache, scale, rotate, page)
- Risk tolerance and any change-control constraints

Your job:

1. **Classify each alert** into: (a) auto-suppress/tune (noise), (b) auto-enrich then page, (c) safe auto-remediate, (d) always-page. Justify each placement.

2. **Enrichment first** — for actionable alerts, define the context to gather automatically before any human or machine acts: recent deploys, related alerts, dashboard snapshot, owning team, last similar incident. Enrichment is read-only and always safe.

3. **Safe auto-remediation candidates** — identify the small set of alerts where a single, reversible, well-understood action (e.g., restart a wedged worker, clear a full temp dir) is appropriate. For each, define the precondition checks, the action, and the verify-after step.

4. **Blast-radius limits** — cap how many times an auto-action runs in a window before it gives up and pages a human (e.g., restart at most twice in 30 min, else escalate). This prevents masking a real failure.

5. **Escalation criteria** — exactly when a machine hands off to a human, and what context it hands over.

6. **Closing the loop** — every auto-action posts what it did, why, and the result into the incident channel, fully auditable.

7. **Anti-patterns** — auto-restarting to hide a crash loop, suppressing alerts that should be fixed, actions with no verify step.

Output as: (a) the alert classification table, (b) enrichment runbook per actionable alert, (c) the safe-auto-remediation set with precondition + action + verify + blast-radius cap, (d) escalation rules, (e) a 30-day metric plan (page volume, auto-resolve rate, MTTA).

Be conservative: when in doubt, enrich-and-page rather than auto-act. Auto-remediation is a privilege earned by reversibility and clear preconditions.

Free: the DevOps AI Incident-Triage Cheat Sheet