Alert Fatigue Reduction Strategy Prompt
Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.
- Target user
- SRE leads and on-call coordinators
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE lead who has cut alert volume by 80% on production teams while improving incident response. You know that volume isn't the problem — actionability is. I will provide: - Current alert inventory (count, severity, channels) - Recent on-call experience (false positives, missed) - Service SLOs (if any) Your job: 1. **Categorize alerts**: - **Symptom-based** — user-visible (latency, error rate) - **Cause-based** — internal (CPU, disk full) → often noise - SLO-based alerts are symptom-based, statistically smart 2. **Audit each alert**: - When firing, what's the next action? - If no action: candidate for removal - If always same action: automate 3. **For severity tiers**: - **critical** — page someone NOW - **warning** — ticket / next business day - **info** — log / digest - Most alerts shouldn't be critical 4. **SLO-based alerts**: - Multi-window burn rate (fast + slow) - Fewer alerts, more meaningful - See `slo-error-budget-multiwindow-burn` 5. **For runbook integration**: - Every alert has a runbook URL - Clear "what to do" reduces decision fatigue 6. **For deprecation process**: - Track alert silence rate - Alerts always silenced = candidate for removal - Quarterly review 7. **For dedup / grouping**: - Same root cause = one notification - Inhibition for cascades 8. **For escalation**: - Acknowledged but unresolved → escalate after N min Mark DESTRUCTIVE: removing alerts without team review (loses signal), increasing thresholds too far (misses real issues), reducing severity universally (incident response slower). --- Alert inventory: [DESCRIBE] On-call experience: [DESCRIBE] SLOs (if any): [DESCRIBE]
Why this prompt works
Alert fatigue is solvable but requires process. This prompt walks the strategy.
How to use it
- Audit existing alerts.
- Define SLOs for symptom-based alerts.
- Severity tier consciously.
- Runbooks for every alert.
Useful commands
# Alert inventory
curl http://prometheus:9090/api/v1/rules | \
jq -r '.data.groups[].rules[] | select(.type=="alerting") | "\(.labels.severity // "?") \(.name)"' | \
sort | uniq -c | sort -nr
# Recent alerts that fired
curl http://alertmanager:9093/api/v2/alerts?active=false | \
jq -r '.[] | "\(.fingerprint) \(.labels.alertname) \(.labels.severity) \(.startsAt)"' | \
head -50
# Silence rate
curl http://alertmanager:9093/api/v2/silences | jq 'length'
# Per-alert frequency
# (over time, via Prometheus self-monitoring)
sum by (alertname)(ALERTS{alertstate="firing"})
Categorization framework
For each alert, answer:
1. What does it tell me about user impact?
- Direct symptom (page)
- Indirect cause (low priority or remove)
2. What action should I take when it fires?
- Specific runbook → keep
- "Look at things" → remove
3. How often does it fire vs warrant action?
- >50% false positive → tune or remove
SLO-based alert example
groups:
- name: slo-alerts
rules:
# Fast burn (consumes 2% of budget in 1h)
- alert: ErrorBudgetBurnFast
expr: |
(
slo:http_errors:rate5m / slo:http_requests:rate5m > (14.4 * 0.001)
and
slo:http_errors:rate1h / slo:http_requests:rate1h > (14.4 * 0.001)
)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning fast"
runbook: "https://runbooks.example.com/error-budget-fast"
# Slow burn (consumes 5% in 6h)
- alert: ErrorBudgetBurnSlow
expr: |
(
slo:http_errors:rate30m / slo:http_requests:rate30m > (6 * 0.001)
and
slo:http_errors:rate6h / slo:http_requests:rate6h > (6 * 0.001)
)
for: 15m
labels:
severity: warning
annotations:
summary: "Error budget burning slowly"
runbook: "https://runbooks.example.com/error-budget-slow"
Tiered severity
# Critical (page)
- alert: ServiceDown
expr: up{job="critical-service"} == 0
for: 2m
labels: { severity: critical }
# Warning (ticket)
- alert: HighCPUSustained
expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
for: 15m
labels: { severity: warning }
# Info (digest)
- alert: DiskFillingSoon
expr: predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
for: 1h
labels: { severity: info }
Common findings this catches
- Critical alerts firing on dev / staging → tier or exclude.
- Cause-based alerts (CPU, memory) waking on-call → switch to symptom-based.
- Runbook URL 404 → document or update.
- Same alert fires every deploy → predictable; mute during deploy windows.
- Alerts that need restart of XYZ — automate the restart.
- Persistent acknowledgment without resolution — auto-escalate.
- Long history of always-silenced alert → remove.
When to escalate
- SLO definition with product / business — strategic.
- On-call rotation overhaul — ops team.
- Major alert pruning — team review.
Related prompts
-
Alertmanager Routing, Grouping & Receivers Prompt
Design Alertmanager routes — receivers (Slack, PagerDuty), grouping, inhibition, repeat intervals, mute timings.
-
Prometheus Alert Rule Generator Prompt
Generate production-quality Prometheus alerting rules with sensible thresholds, labels, and runbook annotations.
-
SLO Error Budget & Multi-Window Burn Rate Alerts Prompt
Design SLO-based alerts — error budgets, multi-burn-rate alerting, SLI selection, burn budget calculation.