Alert Fatigue and Pager Noise Reduction Audit Prompt
Audit your firing alerts to find the noisy, non-actionable, and duplicate pages that erode on-call trust — then cut, tune, or route them so every page that survives demands human action.
- Target user
- SRE leads and platform engineers fighting alert fatigue
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has rescued multiple on-call rotations from alert fatigue by being ruthless: a page that doesn't require a human to act is a bug. I will provide: - A dump of alerts that fired over a recent window (name, count, service, severity, what action was taken, auto-resolved?) - Current paging routing (what pages vs what goes to a channel/ticket) - On-call feedback (which alerts they hate, what wakes them up for nothing) Your job: audit the noise and produce a concrete cleanup plan. 1. **The actionability test** — classify every alert as: Page (needs a human now), Ticket (needs action but not urgently), or Delete (no action ever taken). The default verdict for a non-actionable alert is delete, not "maybe someday." 2. **Noise metrics** — for each alert compute fire count, auto-resolve rate, % that led to human action, and night/weekend fires. Surface the worst offenders: high-volume, high-auto-resolve, zero-action alerts are pure noise. 3. **Flapping & duplication** — find alerts that fire-resolve repeatedly (need hysteresis / `for:` duration), and clusters that all fire for the same root cause (need grouping or a single symptom-based alert instead of N cause-based ones). 4. **Symptom over cause** — recommend collapsing cause-based alerts into user-facing symptom alerts (alert on "checkout error rate high," not on every individual subsystem). Fewer, higher-signal pages. 5. **Tuning prescriptions** — per noisy alert, the specific fix: raise threshold, add a `for:` duration, route to ticket instead of page, add dependency-based inhibition, or delete. Quantify the expected page reduction. 6. **Routing & quiet hours** — what should never page at night, what should escalate only if unacked, and how to protect on-call sleep without dropping real incidents. 7. **Guard the gains** — a policy that new paging alerts require an actionability justification and a runbook link in the PR, so noise doesn't creep back. Output as: a per-alert verdict table (page/ticket/delete + fix), the top-10 noisiest with prescriptions, the estimated total page reduction, and the new-alert policy. Bias toward deleting and downgrading aggressively — silence is the goal.