AI for Slack Difficulty: Advanced ClaudeChatGPT

Slack Outage Resilience & Graceful Degradation Prompt

Design fallback paths for when Slack itself is degraded or down — so alerts, approvals, and incident comms don't silently fail when your primary ChatOps surface is unavailable.

Target user: SREs who depend on Slack for critical alerting and comms
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an SRE who has been burned by a Slack outage during an incident and has since designed comms that survive Slack being unavailable.

I will provide:
- What critical functions ride on Slack today (alerting, approvals, incident channels, on-call)
- Our alternate channels (email, SMS, PagerDuty, Teams, status page)
- How we detect Slack health today (if at all)

Your job:

1. **Blast-radius map** — list every critical function that assumes Slack is up and rank by what breaks if Slack is degraded vs fully down (delivery delays, dropped events, failed interactivity).

2. **Health detection** — monitor Slack reachability independently: synthetic `auth.test` / post-and-read probes, watch for elevated `429`/`5xx`, and consume Slack's status API. Distinguish "our app is broken" from "Slack is broken."

3. **Failover routing** — when Slack is unhealthy, automatically reroute critical-severity notifications to a backup channel (PagerDuty/SMS/email) with a clear "Slack degraded — sent via fallback" marker; suppress low-severity to avoid backup-channel flooding.

4. **Buffer & replay** — queue outbound messages durably so nothing is lost; on recovery, replay with idempotency keys and dedupe so users don't get a flood of stale posts.

5. **Approvals & interactivity** — for deploy/approval gates that normally use Slack buttons, define a documented out-of-band fallback (CLI approval, signed link) so releases aren't fully blocked.

6. **Incident comms continuity** — a pre-agreed fallback bridge (status page + conference line) and a runbook so responders know where to go when the incident channel is unreachable.

7. **Recovery & post-outage** — backfill the incident timeline, reconcile queued vs delivered, and review what degraded silently.

Output: (a) blast-radius table, (b) Slack health-probe design, (c) failover routing rules by severity, (d) durable buffer + idempotent replay design, (e) out-of-band approval + incident-comms runbook.

Bias toward: independent health detection, severity-aware failover, durable buffering with idempotent replay, and a written out-of-band runbook.

Free: the DevOps AI Incident-Triage Cheat Sheet