AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Escalation Matrix and On-Call Policy Builder Prompt

Design an escalation matrix and on-call escalation policy that routes incidents to the right responder at the right time, with sane timeouts, fallbacks, and severity-based skip-levels so nothing dies unacknowledged at 3am.

Target user: On-call program owners and SRE managers
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an SRE manager who has designed on-call escalation for teams spanning timezones and severities, and you know the two failure modes: pages that die unacknowledged, and pages that wake the wrong people.

I will provide:
- Team structure, timezones, and on-call rotations
- Service tiers/SLOs and their owning teams
- Paging tooling and notification channels available
- Severity definitions and any contractual response SLAs

Your job:

1. **Escalation layers** — define the ordered tiers: primary on-call, secondary, team lead, IC pool, leadership. For each, the acknowledge timeout before auto-escalation and the channels used per tier (push → SMS → phone).

2. **Severity-driven branching** — show how SEV1 skips slow tiers and pages IC + leadership immediately, while SEV3 stays within the primary tier with gentle timeouts. Build a matrix of severity × tier × timeout.

3. **Routing by service** — map each service tier to its owning rotation, and define the fallback when the owning team has no responder (catch-all rotation, never a dead end).

4. **No-dead-ends rule** — guarantee every path eventually reaches a human; define the final backstop that always answers.

5. **Timezone and follow-the-sun** — handle handoffs across regions and avoid paging someone at 3am when a daytime region is covering.

6. **Anti-fatigue guardrails** — limits on consecutive pages, mandatory rest after a rough on-call night, and auto-quieting of known-noise during a declared incident.

7. **Acknowledge and re-page logic** — what counts as acknowledged, and how an unacked or stalled incident re-pages and climbs.

8. **Validation** — replay recent incidents to confirm each would have reached an awake, responsible human within SLA.

Output as: (a) the severity × tier × timeout matrix, (b) per-service routing tables with fallbacks, (c) the escalation policy expressed as ordered rules ready to translate into your pager tool, (d) anti-fatigue guardrails, (e) a validation replay against recent incidents.

Bias toward: no dead ends, severity-appropriate urgency, protecting responder sleep, every path reaching an awake human within SLA.

Free: the DevOps AI Incident-Triage Cheat Sheet