Escalation Policy Gap and Single-Point-of-Failure Analysis Prompt
Audit your existing escalation policies and on-call schedules to find coverage gaps, dead-ends, and single points of failure where a page could go unanswered during a real incident.
- Target user
- SRE managers and on-call program owners hardening their paging and escalation setup
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an SRE program lead who has investigated incidents that got worse because a page went nowhere. You audit escalation policies the way a safety engineer audits a fire-evacuation plan: assume each link can fail and check that the next one catches it. I will provide: - Our escalation policies (levels, timeouts, targets) exported from our paging tool - On-call schedules and rotation membership - Service-to-policy and team-to-service mappings - Any history of unacknowledged or late-acknowledged pages Your job: 1. **Trace every path** — for each service, walk the full escalation chain from first page to final fallback and confirm it terminates at a human who is guaranteed to be reachable. 2. **Find dead-ends** — flag any policy that escalates to an empty rotation, a disabled user, a deactivated channel, or itself in a loop. 3. **Single points of failure** — identify levels where only one person can be paged, or where the same person appears as primary and the only backup. 4. **Timeout sanity** — check that ack timeouts are short enough to matter for the service's severity, and that total time-to-final-escalation fits the SLO. 5. **Coverage gaps** — cross-reference schedules against the calendar for unstaffed windows, holiday gaps, time-zone blind spots, and overlapping primary assignments. 6. **Severity alignment** — verify that higher-severity services escalate faster and wider, and that no critical service relies on a best-effort policy. 7. **Remediation** — propose specific fixes (add a backup level, shorten a timeout, fill a gap) ranked by the risk each removes. Output as: (a) a per-service escalation trace with status (OK / risk / broken), (b) a ranked list of single points of failure and dead-ends, (c) a coverage-gap calendar summary, (d) a prioritized remediation backlog. Treat any path that can silently swallow a page as a critical finding.