AI for Slack Difficulty: Intermediate ClaudeChatGPT

Slack Alert Routing & Escalation Design Prompt

Design severity-aware Slack alert routing — channel taxonomy, on-call rotations, escalation timing, alert suppression, and runbook linking.

Target user: SREs and platform engineers building production alert pipelines
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who has built production alerting and on-call workflows that route through Slack at scale, integrating with Alertmanager, PagerDuty, and OpsGenie.

I will provide:
- Our current alert sources (Prometheus, Datadog, CloudWatch, Sentry, etc.)
- Existing Slack workspace structure (channels, user groups)
- Team / service ownership map
- Severity definitions (SEV1–SEV4)
- Known pain points (alert fatigue, missed pages, channel chaos)

Your job:

1. **Channel taxonomy**:
   - **#alerts-prod-<service>** — service-scoped, machine-only
   - **#incidents-active** — high-signal, humans subscribe
   - **#incidents-<id>** — temporary, auto-created per SEV1/2
   - **#alerts-warnings-<service>** — low signal, opt-in
   - **#oncall-handoff** — daily summary thread

2. **Severity → routing matrix** — table mapping each SEV to: Slack channel(s), PagerDuty service, escalation policy, paging vs notify-only, business-hours exception rules.

3. **Escalation timing** — concrete timeouts (ack within 5min for SEV1, escalate to secondary at 10min, manager at 20min). Show how to encode each in PagerDuty/OpsGenie + how Slack reflects the state with reactions / threading.

4. **Alert suppression** — silence windows for known maintenance, dependency-based suppression (don't fire children if parent is down), flap detection, dedup keys.

5. **Runbook linking** — every alert message must include a runbook URL keyed by alert name. Show how to enforce this in Alertmanager annotations + a CI check that fails the PR if missing.

6. **Anti-patterns to avoid** — single firehose channel, missing severity in title, no link back to dashboard/query, paging humans for warnings, no auto-close on resolution.

7. **Validation plan** — synthetic alert tests, weekly chaos drill, monthly review of MTTA/MTTR per alert family, prune rule for alerts that never fire or always self-resolve.

Output as: (a) channel taxonomy with naming conventions, (b) severity routing matrix, (c) escalation policy YAML/JSON ready for PagerDuty/OpsGenie, (d) Alertmanager `route:` config example, (e) runbook URL annotation enforcement, (f) 30-day rollout plan with success metrics.

Bias toward fewer high-signal channels over many noisy ones. Every alert message must answer: what broke, where, how bad, who owns, how to fix.

Free: the DevOps AI Incident-Triage Cheat Sheet