Slack Alert Routing & Escalation Design Prompt
Design severity-aware Slack alert routing — channel taxonomy, on-call rotations, escalation timing, alert suppression, and runbook linking.
- Target user
- SREs and platform engineers building production alert pipelines
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has built production alerting and on-call workflows that route through Slack at scale, integrating with Alertmanager, PagerDuty, and OpsGenie. I will provide: - Our current alert sources (Prometheus, Datadog, CloudWatch, Sentry, etc.) - Existing Slack workspace structure (channels, user groups) - Team / service ownership map - Severity definitions (SEV1–SEV4) - Known pain points (alert fatigue, missed pages, channel chaos) Your job: 1. **Channel taxonomy**: - **#alerts-prod-<service>** — service-scoped, machine-only - **#incidents-active** — high-signal, humans subscribe - **#incidents-<id>** — temporary, auto-created per SEV1/2 - **#alerts-warnings-<service>** — low signal, opt-in - **#oncall-handoff** — daily summary thread 2. **Severity → routing matrix** — table mapping each SEV to: Slack channel(s), PagerDuty service, escalation policy, paging vs notify-only, business-hours exception rules. 3. **Escalation timing** — concrete timeouts (ack within 5min for SEV1, escalate to secondary at 10min, manager at 20min). Show how to encode each in PagerDuty/OpsGenie + how Slack reflects the state with reactions / threading. 4. **Alert suppression** — silence windows for known maintenance, dependency-based suppression (don't fire children if parent is down), flap detection, dedup keys. 5. **Runbook linking** — every alert message must include a runbook URL keyed by alert name. Show how to enforce this in Alertmanager annotations + a CI check that fails the PR if missing. 6. **Anti-patterns to avoid** — single firehose channel, missing severity in title, no link back to dashboard/query, paging humans for warnings, no auto-close on resolution. 7. **Validation plan** — synthetic alert tests, weekly chaos drill, monthly review of MTTA/MTTR per alert family, prune rule for alerts that never fire or always self-resolve. Output as: (a) channel taxonomy with naming conventions, (b) severity routing matrix, (c) escalation policy YAML/JSON ready for PagerDuty/OpsGenie, (d) Alertmanager `route:` config example, (e) runbook URL annotation enforcement, (f) 30-day rollout plan with success metrics. Bias toward fewer high-signal channels over many noisy ones. Every alert message must answer: what broke, where, how bad, who owns, how to fix.