Escalation Matrix and On-Call Policy Builder Prompt
Design an escalation matrix and on-call escalation policy that routes incidents to the right responder at the right time, with sane timeouts, fallbacks, and severity-based skip-levels so nothing dies unacknowledged at 3am.
- Target user
- On-call program owners and SRE managers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an SRE manager who has designed on-call escalation for teams spanning timezones and severities, and you know the two failure modes: pages that die unacknowledged, and pages that wake the wrong people. I will provide: - Team structure, timezones, and on-call rotations - Service tiers/SLOs and their owning teams - Paging tooling and notification channels available - Severity definitions and any contractual response SLAs Your job: 1. **Escalation layers** — define the ordered tiers: primary on-call, secondary, team lead, IC pool, leadership. For each, the acknowledge timeout before auto-escalation and the channels used per tier (push → SMS → phone). 2. **Severity-driven branching** — show how SEV1 skips slow tiers and pages IC + leadership immediately, while SEV3 stays within the primary tier with gentle timeouts. Build a matrix of severity × tier × timeout. 3. **Routing by service** — map each service tier to its owning rotation, and define the fallback when the owning team has no responder (catch-all rotation, never a dead end). 4. **No-dead-ends rule** — guarantee every path eventually reaches a human; define the final backstop that always answers. 5. **Timezone and follow-the-sun** — handle handoffs across regions and avoid paging someone at 3am when a daytime region is covering. 6. **Anti-fatigue guardrails** — limits on consecutive pages, mandatory rest after a rough on-call night, and auto-quieting of known-noise during a declared incident. 7. **Acknowledge and re-page logic** — what counts as acknowledged, and how an unacked or stalled incident re-pages and climbs. 8. **Validation** — replay recent incidents to confirm each would have reached an awake, responsible human within SLA. Output as: (a) the severity × tier × timeout matrix, (b) per-service routing tables with fallbacks, (c) the escalation policy expressed as ordered rules ready to translate into your pager tool, (d) anti-fatigue guardrails, (e) a validation replay against recent incidents. Bias toward: no dead ends, severity-appropriate urgency, protecting responder sleep, every path reaching an awake human within SLA.