Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Alert Fatigue Reduction Strategy Prompt

Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.

Target user
SRE leads and on-call coordinators
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior SRE lead who has cut alert volume by 80% on production teams while improving incident response. You know that volume isn't the problem — actionability is.

I will provide:
- Current alert inventory (count, severity, channels)
- Recent on-call experience (false positives, missed)
- Service SLOs (if any)

Your job:

1. **Categorize alerts**:
   - **Symptom-based** — user-visible (latency, error rate)
   - **Cause-based** — internal (CPU, disk full) → often noise
   - SLO-based alerts are symptom-based, statistically smart
2. **Audit each alert**:
   - When firing, what's the next action?
   - If no action: candidate for removal
   - If always same action: automate
3. **For severity tiers**:
   - **critical** — page someone NOW
   - **warning** — ticket / next business day
   - **info** — log / digest
   - Most alerts shouldn't be critical
4. **SLO-based alerts**:
   - Multi-window burn rate (fast + slow)
   - Fewer alerts, more meaningful
   - See `slo-error-budget-multiwindow-burn`
5. **For runbook integration**:
   - Every alert has a runbook URL
   - Clear "what to do" reduces decision fatigue
6. **For deprecation process**:
   - Track alert silence rate
   - Alerts always silenced = candidate for removal
   - Quarterly review
7. **For dedup / grouping**:
   - Same root cause = one notification
   - Inhibition for cascades
8. **For escalation**:
   - Acknowledged but unresolved → escalate after N min

Mark DESTRUCTIVE: removing alerts without team review (loses signal), increasing thresholds too far (misses real issues), reducing severity universally (incident response slower).

---

Alert inventory: [DESCRIBE]
On-call experience: [DESCRIBE]
SLOs (if any): [DESCRIBE]

Why this prompt works

Alert fatigue is solvable but requires process. This prompt walks the strategy.

How to use it

  1. Audit existing alerts.
  2. Define SLOs for symptom-based alerts.
  3. Severity tier consciously.
  4. Runbooks for every alert.

Useful commands

# Alert inventory
curl http://prometheus:9090/api/v1/rules | \
    jq -r '.data.groups[].rules[] | select(.type=="alerting") | "\(.labels.severity // "?") \(.name)"' | \
    sort | uniq -c | sort -nr

# Recent alerts that fired
curl http://alertmanager:9093/api/v2/alerts?active=false | \
    jq -r '.[] | "\(.fingerprint) \(.labels.alertname) \(.labels.severity) \(.startsAt)"' | \
    head -50

# Silence rate
curl http://alertmanager:9093/api/v2/silences | jq 'length'

# Per-alert frequency
# (over time, via Prometheus self-monitoring)
sum by (alertname)(ALERTS{alertstate="firing"})

Categorization framework

For each alert, answer:
1. What does it tell me about user impact?
   - Direct symptom (page)
   - Indirect cause (low priority or remove)
2. What action should I take when it fires?
   - Specific runbook → keep
   - "Look at things" → remove
3. How often does it fire vs warrant action?
   - >50% false positive → tune or remove

SLO-based alert example

groups:
- name: slo-alerts
  rules:
  # Fast burn (consumes 2% of budget in 1h)
  - alert: ErrorBudgetBurnFast
    expr: |
      (
        slo:http_errors:rate5m / slo:http_requests:rate5m > (14.4 * 0.001)
        and
        slo:http_errors:rate1h / slo:http_requests:rate1h > (14.4 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Error budget burning fast"
      runbook: "https://runbooks.example.com/error-budget-fast"

  # Slow burn (consumes 5% in 6h)
  - alert: ErrorBudgetBurnSlow
    expr: |
      (
        slo:http_errors:rate30m / slo:http_requests:rate30m > (6 * 0.001)
        and
        slo:http_errors:rate6h / slo:http_requests:rate6h > (6 * 0.001)
      )
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Error budget burning slowly"
      runbook: "https://runbooks.example.com/error-budget-slow"

Tiered severity

# Critical (page)
- alert: ServiceDown
  expr: up{job="critical-service"} == 0
  for: 2m
  labels: { severity: critical }

# Warning (ticket)
- alert: HighCPUSustained
  expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.8
  for: 15m
  labels: { severity: warning }

# Info (digest)
- alert: DiskFillingSoon
  expr: predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0
  for: 1h
  labels: { severity: info }

Common findings this catches

  • Critical alerts firing on dev / staging → tier or exclude.
  • Cause-based alerts (CPU, memory) waking on-call → switch to symptom-based.
  • Runbook URL 404 → document or update.
  • Same alert fires every deploy → predictable; mute during deploy windows.
  • Alerts that need restart of XYZ — automate the restart.
  • Persistent acknowledgment without resolution — auto-escalate.
  • Long history of always-silenced alert → remove.

When to escalate

  • SLO definition with product / business — strategic.
  • On-call rotation overhaul — ops team.
  • Major alert pruning — team review.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week