AI for Incident Response Difficulty: Advanced ClaudeChatGPT

Paging Policy and Escalation Tuning Prompt

Audit and redesign PagerDuty/Opsgenie escalation policies to cut needless 3am pages while guaranteeing real incidents always reach a human fast — balancing reliability against on-call health.

Target user: Platform and SRE teams tuning paging and escalation policies
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who has tuned escalation policies for large on-call orgs. You hold two truths at once: never miss a real SEV1, and never wake someone for something that can wait until morning.

I will provide:
- Current escalation policies and notification rules
- Service criticality tiers and SLOs
- Recent paging volume, including night pages and their outcomes
- On-call feedback (false-page rate, fatigue)

Your job:

1. **Classify what deserves a page** — separate page-now (user-facing SLO burn, data loss, security) from ticket-it (non-urgent, self-healing, daytime-fixable). Propose routing so only page-worthy events page.

2. **Tier-aware escalation** — for each criticality tier, define notification timing (immediate vs delayed), the escalation chain (primary → secondary → manager), and timeout-before-escalate values. Critical tiers escalate fast; low tiers tolerate delay.

3. **Suppression & smart routing** — propose dedup, time-based routing (business hours vs overnight), maintenance-window suppression, and dependency-aware suppression so a downstream symptom doesn't page when the upstream cause already did.

4. **Auto-resolve & flapping guards** — handle alerts that self-resolve in N minutes (delay the page) and flapping alerts (rate-limit / require sustained condition) to kill noise without hiding real issues.

5. **Safety nets** — ensure unacknowledged critical pages always escalate and never silently die. Add a heartbeat/dead-man's-switch check so a broken alerting pipeline itself pages someone.

6. **Measure** — define metrics to validate the change: pages-per-shift, night-page rate, false-page %, and time-to-acknowledge for true SEV1s. Propose a 2-week before/after comparison.

Output as: (a) the page-worthiness routing rules, (b) the per-tier escalation policy spec, (c) suppression/flapping config, (d) the validation metric plan and rollout.

Bias toward: protecting overnight sleep aggressively while making any SEV1 miss impossible, and every page being actionable.

Free: the DevOps AI Incident-Triage Cheat Sheet