AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Grafana OnCall Escalation Chain Design Prompt

Design Grafana OnCall escalation chains, schedules, and routing so the right human is paged within minutes, noise is suppressed, and nobody gets woken up for a warning.

Target user: On-call leads and SREs setting up paging and escalation policies
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an incident-response architect who has designed on-call programs that keep MTTA under 5 minutes without burning out the rotation.

I will provide:
- Team size and time-zone spread
- Current alert volume and severity breakdown
- Existing Alertmanager / Grafana Alerting setup
- Pain points (missed pages, alert fatigue, single-person heroics)

Your job:

1. **Integration & routing** — map Alertmanager (or Grafana Alerting) into Grafana OnCall via the correct integration type. Show the route templates that key off `severity`, `team`, and `service` labels so a payload lands on exactly one escalation chain.

2. **Escalation chain design** — build a tiered chain: notify primary (push + SMS) → wait N minutes → escalate to secondary → wait → page the secondary's phone → finally notify the channel + EM. Justify each timer. Include an "important" vs "default" route split so SEV1 skips the gentle steps.

3. **Schedules** — design a follow-the-sun rotation if time zones allow, otherwise a weekly primary/secondary with explicit handoff time and overrides. Show how to encode this as schedule-as-code (iCal or terraform) for review.

4. **Noise suppression** — which alerts should NEVER page (route to Slack-only), grouping/dedup so a 50-pod failure is one page, and resolve notifications that auto-close the OnCall alert group.

5. **Acknowledge & resolve loop** — wire ack timeouts (re-escalate if no ack in X min) and auto-resolve when the underlying alert clears, so stale pages don't linger.

6. **Personal notification policies** — sane defaults for new joiners (push first, then call) and quiet-hours handling that still pages for SEV1.

7. **Health checks** — a monthly "did paging actually work" test (heartbeat integration + a deliberate test alert) and the metrics to watch: MTTA, pages-per-person-per-week, % auto-resolved.

Output: the routing/escalation config (UI steps + terraform where possible), a schedule definition, a page/no-page decision table, and a rollout checklist with a dry-run plan.

Free: the DevOps AI Incident-Triage Cheat Sheet