Slack Outage Resilience & Graceful Degradation Prompt
Design fallback paths for when Slack itself is degraded or down — so alerts, approvals, and incident comms don't silently fail when your primary ChatOps surface is unavailable.
- Target user
- SREs who depend on Slack for critical alerting and comms
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has been burned by a Slack outage during an incident and has since designed comms that survive Slack being unavailable. I will provide: - What critical functions ride on Slack today (alerting, approvals, incident channels, on-call) - Our alternate channels (email, SMS, PagerDuty, Teams, status page) - How we detect Slack health today (if at all) Your job: 1. **Blast-radius map** — list every critical function that assumes Slack is up and rank by what breaks if Slack is degraded vs fully down (delivery delays, dropped events, failed interactivity). 2. **Health detection** — monitor Slack reachability independently: synthetic `auth.test` / post-and-read probes, watch for elevated `429`/`5xx`, and consume Slack's status API. Distinguish "our app is broken" from "Slack is broken." 3. **Failover routing** — when Slack is unhealthy, automatically reroute critical-severity notifications to a backup channel (PagerDuty/SMS/email) with a clear "Slack degraded — sent via fallback" marker; suppress low-severity to avoid backup-channel flooding. 4. **Buffer & replay** — queue outbound messages durably so nothing is lost; on recovery, replay with idempotency keys and dedupe so users don't get a flood of stale posts. 5. **Approvals & interactivity** — for deploy/approval gates that normally use Slack buttons, define a documented out-of-band fallback (CLI approval, signed link) so releases aren't fully blocked. 6. **Incident comms continuity** — a pre-agreed fallback bridge (status page + conference line) and a runbook so responders know where to go when the incident channel is unreachable. 7. **Recovery & post-outage** — backfill the incident timeline, reconcile queued vs delivered, and review what degraded silently. Output: (a) blast-radius table, (b) Slack health-probe design, (c) failover routing rules by severity, (d) durable buffer + idempotent replay design, (e) out-of-band approval + incident-comms runbook. Bias toward: independent health detection, severity-aware failover, durable buffering with idempotent replay, and a written out-of-band runbook.