You are a senior SRE lead who has cut alert volume by 80% on production teams while improving incident response. You know that volume isn't the problem — actionability is. I will provide: - Current alert inventory (count, severity, channels) - Recent on-call experience (false positives, missed) - Service SLOs (if any) Your job: 1. **Categorize alerts**: - **Symptom-based** — user-visible (latency, error rate) - **Cause-based** — internal (CPU, disk full) → often noise - SLO-based alerts are symptom-based, statistically smart 2. **Audit each alert**: - When firing, what's the next action? - If no action: candidate for removal - If always same action: automate 3. **For severity tiers**: - **critical** — page someone NOW - **warning** — ticket / next business day - **info** — log / digest - Most alerts shouldn't be critical 4. **SLO-based alerts**: - Multi-window burn rate (fast + slow) - Fewer alerts, more meaningful - See `slo-error-budget-multiwindow-burn` 5. **For runbook integration**: - Every alert has a runbook URL - Clear "what to do" reduces decision fatigue 6. **For deprecation process**: - Track alert silence rate - Alerts always silenced = candidate for removal - Quarterly review 7. **For dedup / grouping**: - Same root cause = one notification - Inhibition for cascades 8. **For escalation**: - Acknowledged but unresolved → escalate after N min Mark DESTRUCTIVE: removing alerts without team review (loses signal), increasing thresholds too far (misses real issues), reducing severity universally (incident response slower). --- Alert inventory: [DESCRIBE] On-call experience: [DESCRIBE] SLOs (if any): [DESCRIBE]

Why this prompt works

Alert fatigue is solvable but requires process. This prompt walks the strategy.

How to use it

Audit existing alerts.
Define SLOs for symptom-based alerts.
Severity tier consciously.
Runbooks for every alert.

Useful commands

# Alert inventory
curl http://prometheus:9090/api/v1/rules | \
    jq -r '.data.groups[].rules[] | select(.type=="alerting") | "\(.labels.severity // "?") \(.name)"' | \
    sort | uniq -c | sort -nr

# Recent alerts that fired
curl http://alertmanager:9093/api/v2/alerts?active=false | \
    jq -r '.[] | "\(.fingerprint) \(.labels.alertname) \(.labels.severity) \(.startsAt)"' | \
    head -50

# Silence rate
curl http://alertmanager:9093/api/v2/silences | jq 'length'

# Per-alert frequency
# (over time, via Prometheus self-monitoring)
sum by (alertname)(ALERTS{alertstate="firing"})

Categorization framework

For each alert, answer:
1. What does it tell me about user impact?
   - Direct symptom (page)
   - Indirect cause (low priority or remove)
2. What action should I take when it fires?
   - Specific runbook → keep
   - "Look at things" → remove
3. How often does it fire vs warrant action?
   - >50% false positive → tune or remove

SLO-based alert example

groups:
- name: slo-alerts
  rules:
  # Fast burn (consumes 2% of budget in 1h)
  - alert: ErrorBudgetBurnFast
    expr: |
      (
        slo:http_errors:rate5m / slo:http_requests:rate5m > (14.4 * 0.001)
        and
        slo:http_errors:rate1h / slo:http_requests:rate1h > (14.4 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Error budget burning fast"
      runbook: "https://runbooks.example.com/error-budget-fast"

  # Slow burn (consumes 5% in 6h)
  - alert: ErrorBudgetBurnSlow
    expr: |
      (
        slo:http_errors:rate30m / slo:http_requests:rate30m > (6 * 0.001)
        and
        slo:http_errors:rate6h / slo:http_requests:rate6h > (6 * 0.001)
      )
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burning slowly"
      runbook: "https://runbooks.example.com/error-budget-slow"

Tiered severity

# Critical (page)
- alert: ServiceDown
  expr: up{job="critical-service"} == 0
  for: 2m
  labels: { severity: critical }

# Warning (ticket)
- alert: HighCPUSustained
  expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
  for: 15m
  labels: { severity: warning }

# Info (digest)
- alert: DiskFillingSoon
  expr: predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
  for: 1h
  labels: { severity: info }

Common findings this catches

Critical alerts firing on dev / staging → tier or exclude.
Cause-based alerts (CPU, memory) waking on-call → switch to symptom-based.
Runbook URL 404 → document or update.
Same alert fires every deploy → predictable; mute during deploy windows.
Alerts that need restart of XYZ — automate the restart.
Persistent acknowledgment without resolution — auto-escalate.
Long history of always-silenced alert → remove.

When to escalate

SLO definition with product / business — strategic.
On-call rotation overhaul — ops team.
Major alert pruning — team review.

Alert Fatigue Reduction Strategy Prompt

Why this prompt works

How to use it

Useful commands

Categorization framework

SLO-based alert example

Tiered severity

Common findings this catches

When to escalate

Related prompts

Alertmanager Routing, Grouping & Receivers Prompt

Prometheus Alert Rule Generator Prompt

SLO Error Budget & Multi-Window Burn Rate Alerts Prompt

Why this prompt works

How to use it

Useful commands

Categorization framework

SLO-based alert example

Tiered severity

Common findings this catches

When to escalate

Related prompts

Alertmanager Routing, Grouping & Receivers Prompt

Prometheus Alert Rule Generator Prompt

SLO Error Budget & Multi-Window Burn Rate Alerts Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet