Slack Alert Acknowledgment Policy & Timer Enforcement Prompt
Enforce alert acknowledgment SLAs in Slack — per-severity ack windows, escalation timers, secondary on-call paging, manager notification, and post-incident review.
- Target user
- SRE leads tightening alert response discipline
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has implemented alert acknowledgment SLAs that turned a culture of "missed pages" into reliable response with healthy on-call satisfaction. I will provide: - On-call tool (PagerDuty / Opsgenie / Grafana OnCall) - Severity definitions - Current ack rate + miss rate - On-call team structure (single / primary-secondary / follow-the-sun) - Pain points (missed pages, slow ack, escalation chaos) Your job: 1. **Ack policy per severity**: - **SEV1**: ack within 5 min; escalate at 10 min; manager at 15 min - **SEV2**: ack within 15 min; escalate at 30 min - **SEV3**: ack within 1 hour during business hours; no after-hours pages - **SEV4**: acknowledge in daily review; no immediate ack required 2. **Slack message anatomy with ack** — every alert message has: - Severity in title - Service + summary - Runbook link - **Ack button** — single-click, identifies the acker - **Snooze 15m** — for "I see it, give me a min" without missing the ack - **Escalate now** — fast-track to secondary - **Open incident** — for SEV1/2 that escalate to full incident - Visible ack-timer countdown (when supported by Block Kit, otherwise text) 3. **Escalation timer** — a state machine: - **T+0** — alert posted, primary paged via PagerDuty/Opsgenie - **T+5min** (SEV1) — bot checks ack state; if not acked, post in channel "@here: alert not yet acked"; PagerDuty escalates to secondary - **T+10min** — secondary paged; bot posts in channel - **T+15min** — manager paged; bot posts in channel + DM to manager - **T+30min** — director paged 4. **Ack action**: - Records: acker, timestamp, time-from-first-page - Updates message: "Acknowledged by @user at T+3m12s" - Stops the escalation chain - Posts thread reply: "What are you doing about this?" prompt - If user is on secondary's shift but primary acked: log + thank, no escalation 5. **Snooze** — for "I see it, give me a minute": - 15 min window, ack chain pauses (does NOT cancel escalation) - If primary doesn't fully ack within snooze, escalation resumes - Used for "switching context" 6. **Manager notification policy** — when manager is paged: - DM with full incident context - In-channel notification (transparent, no surprise) - Manager joins or delegates to senior engineer 7. **Post-incident review** — automated: - Bot tracks ack times for every page - Weekly digest to team: median + p95 ack time per severity - Quarterly: trend by individual (for coaching, not punishment) - Identify systemic issues (always-late on Tuesdays = standup conflict; always-fast = burnout signal) 8. **Burnout signals to watch**: - On-call consistently acks within 30 seconds (sleeping with phone) - Same person handles > 80% of incidents (rotation imbalance) - Slow ack correlating with vacation requests denied - Ack rate dropping over weeks for an individual - Surface these to manager + on-call lead for intervention 9. **What NOT to do**: - Page-and-shame culture - Punishing slow ack without root-cause analysis (overwork? sick? bad alert quality?) - Auto-escalating SEV3/4 (creates page-fatigue) - Strict timers without on-call tool integration (must enforce in PagerDuty/Opsgenie, not just Slack) - No mechanism for "I'm sick / cover for me" 10. **Edge cases**: - Two people ack simultaneously → first wins, second logged - Ack from mobile vs desktop (both should work) - Ack from non-on-call person → log + thank, but escalation continues until on-call acks - PagerDuty/Opsgenie ack happens outside Slack → bot updates message via webhook Output as: (a) ack policy per severity table, (b) Block Kit alert message with ack buttons, (c) escalation state machine, (d) ack/snooze action logic, (e) manager notification flow, (f) review report schema, (g) burnout-signal detection, (h) edge case handling. Bias toward: clear timeline expectations, automation enforcing both responsiveness AND health, surface burnout signals early.