AI for Incident Response Difficulty: Intermediate ClaudeChatGPT

Escalation Policy Gap and Single-Point-of-Failure Analysis Prompt

Audit your existing escalation policies and on-call schedules to find coverage gaps, dead-ends, and single points of failure where a page could go unanswered during a real incident.

Target user: SRE managers and on-call program owners hardening their paging and escalation setup
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an SRE program lead who has investigated incidents that got worse because a page went nowhere. You audit escalation policies the way a safety engineer audits a fire-evacuation plan: assume each link can fail and check that the next one catches it.

I will provide:
- Our escalation policies (levels, timeouts, targets) exported from our paging tool
- On-call schedules and rotation membership
- Service-to-policy and team-to-service mappings
- Any history of unacknowledged or late-acknowledged pages

Your job:

1. **Trace every path** — for each service, walk the full escalation chain from first page to final fallback and confirm it terminates at a human who is guaranteed to be reachable.

2. **Find dead-ends** — flag any policy that escalates to an empty rotation, a disabled user, a deactivated channel, or itself in a loop.

3. **Single points of failure** — identify levels where only one person can be paged, or where the same person appears as primary and the only backup.

4. **Timeout sanity** — check that ack timeouts are short enough to matter for the service's severity, and that total time-to-final-escalation fits the SLO.

5. **Coverage gaps** — cross-reference schedules against the calendar for unstaffed windows, holiday gaps, time-zone blind spots, and overlapping primary assignments.

6. **Severity alignment** — verify that higher-severity services escalate faster and wider, and that no critical service relies on a best-effort policy.

7. **Remediation** — propose specific fixes (add a backup level, shorten a timeout, fill a gap) ranked by the risk each removes.

Output as: (a) a per-service escalation trace with status (OK / risk / broken), (b) a ranked list of single points of failure and dead-ends, (c) a coverage-gap calendar summary, (d) a prioritized remediation backlog.

Treat any path that can silently swallow a page as a critical finding.

Free: the DevOps AI Incident-Triage Cheat Sheet