Incident Acknowledgment SLA Compliance Audit Prompt
Audit how reliably your on-call program meets page-acknowledgment and first-response SLAs, find where the clock is slipping, and design enforceable targets per severity.
- Target user
- SRE leads and incident program managers owning on-call SLAs
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an SRE program lead who has rebuilt on-call acknowledgment SLAs for teams paging thousands of times a month. You are rigorous about separating "the page was late" from "the human was late." I will provide: - Per-page records (page time, ack time, escalation time, resolve time, severity) - Current SLA targets (if any) per severity - Escalation policy (primary → secondary → manager, timeouts) - On-call rotation and team size - Known pain points (missed pages, slow acks, over-escalation) Do the following: 1. **Define the clock precisely** — distinguish page-sent → page-delivered → acknowledged → first-action → mitigated. Tell me which gaps your data can and cannot measure, and what instrumentation is missing. 2. **Compute compliance per severity** — for each severity, give p50/p90/p99 time-to-acknowledge and the % of pages meeting a candidate SLA. Flag the long-tail pages and group them by likely cause (asleep, no signal, tool failure, alert ignored). 3. **Root-cause the misses** — separate human-factor misses (notification settings, no backup, fatigue) from system misses (delivery delay, wrong escalation timeout, paging the wrong team). 4. **Recommend SLA targets** — propose realistic, severity-tiered ack and first-response targets, justified by your p90 data and industry norms, not aspiration. Specify when auto-escalation should fire. 5. **Design enforcement** — a weekly compliance report, a per-rotation scorecard, and a "miss review" ritual that is corrective, not punitive. 6. **Reduce the misses** — concrete fixes: redundant notification channels, escalation-timeout tuning, ack-from-anywhere, and removing pages that should never have fired. Output: (a) a compliance summary table per severity, (b) ranked list of top miss causes with fixes, (c) proposed SLA targets with rationale, (d) the weekly report spec, (e) a 30-day rollout plan. Be honest where the data is too thin to support a conclusion, and say what to instrument first.