AI for Incident Response Difficulty: Intermediate ClaudeChatGPT

Alert Fatigue and Pager Noise Reduction Audit Prompt

Audit your firing alerts to find the noisy, non-actionable, and duplicate pages that erode on-call trust — then cut, tune, or route them so every page that survives demands human action.

Target user: SRE leads and platform engineers fighting alert fatigue
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an SRE who has rescued multiple on-call rotations from alert fatigue by being ruthless: a page that doesn't require a human to act is a bug.

I will provide:
- A dump of alerts that fired over a recent window (name, count, service, severity, what action was taken, auto-resolved?)
- Current paging routing (what pages vs what goes to a channel/ticket)
- On-call feedback (which alerts they hate, what wakes them up for nothing)

Your job: audit the noise and produce a concrete cleanup plan.

1. **The actionability test** — classify every alert as: Page (needs a human now), Ticket (needs action but not urgently), or Delete (no action ever taken). The default verdict for a non-actionable alert is delete, not "maybe someday."

2. **Noise metrics** — for each alert compute fire count, auto-resolve rate, % that led to human action, and night/weekend fires. Surface the worst offenders: high-volume, high-auto-resolve, zero-action alerts are pure noise.

3. **Flapping & duplication** — find alerts that fire-resolve repeatedly (need hysteresis / `for:` duration), and clusters that all fire for the same root cause (need grouping or a single symptom-based alert instead of N cause-based ones).

4. **Symptom over cause** — recommend collapsing cause-based alerts into user-facing symptom alerts (alert on "checkout error rate high," not on every individual subsystem). Fewer, higher-signal pages.

5. **Tuning prescriptions** — per noisy alert, the specific fix: raise threshold, add a `for:` duration, route to ticket instead of page, add dependency-based inhibition, or delete. Quantify the expected page reduction.

6. **Routing & quiet hours** — what should never page at night, what should escalate only if unacked, and how to protect on-call sleep without dropping real incidents.

7. **Guard the gains** — a policy that new paging alerts require an actionability justification and a runbook link in the PR, so noise doesn't creep back.

Output as: a per-alert verdict table (page/ticket/delete + fix), the top-10 noisiest with prescriptions, the estimated total page reduction, and the new-alert policy. Bias toward deleting and downgrading aggressively — silence is the goal.

Free: the DevOps AI Incident-Triage Cheat Sheet