Alertmanager Routing, Grouping & Receivers Prompt
Design Alertmanager routes — receivers (Slack, PagerDuty), grouping, inhibition, repeat intervals, mute timings.
- Target user
- SREs configuring Alertmanager
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has designed Alertmanager routing for production incidents. You know how to balance noise (group properly), urgency (route to right channel), and recovery (timely resolve notifications). I will provide: - Current Alertmanager config - Symptom (too noisy, missed alerts, wrong recipient, no resolve) - Alert volume / team structure Your job: 1. **Receivers**: - Slack — channel + webhook - PagerDuty — integration key - Email - Webhook for custom - OpsGenie, VictorOps, etc. 2. **Routing tree**: - Top-level matches alerts (e.g., severity) - Child routes for finer matching - `continue: true` for matching multiple - Default fallback at root 3. **Grouping**: - `group_by` labels combine related alerts into single notification - `group_wait` — initial delay to gather more - `group_interval` — within group, how often to update - `repeat_interval` — re-notify if unresolved 4. **Inhibition**: - "If alert X firing, suppress alert Y" - Useful for cascading failures - Example: node down → suppress per-pod alerts 5. **For mute timings** (newer): - Time-based suppression - E.g., mute non-critical during deploys 6. **For severity tiers**: - critical → PagerDuty - warning → Slack - info → email digest 7. **For team-based routing**: - `team` label on alerts - Route per team to their channel 8. **For multi-tenant**: - tenant label - Separate Alertmanager instances or careful routing Mark DESTRUCTIVE: catch-all route to wrong channel (mass missed), repeat_interval too short (alert spam), inhibition rule that suppresses everything. --- Current config: ```yaml [PASTE] ``` Symptom: [DESCRIBE] Team structure: [DESCRIBE]
Why this prompt works
Alertmanager routing is config-heavy. This prompt walks the structure.
How to use it
- Start with severity routing.
- Add team-based routes.
- Apply inhibition for cascades.
- Tune grouping for clarity.
Useful commands
# Check config syntax
amtool check-config /etc/alertmanager/alertmanager.yml
# List silences
amtool silence query
# Create silence
amtool silence add alertname=HighCPU --duration=2h --comment="planned"
# View current routing
amtool config routes show
# Test routing for an alert label set
amtool config routes test severity=critical team=payments
# View Alertmanager status (via API)
curl http://alertmanager:9093/api/v2/status
curl http://alertmanager:9093/api/v2/alerts | jq
Routing example
route:
group_by: [alertname, cluster, service]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'fallback'
routes:
# Critical → PagerDuty
- matchers:
- severity = critical
receiver: 'pagerduty-platform'
group_wait: 10s # faster for critical
continue: true # also send to Slack
# Per-team Slack
- matchers:
- team = payments
receiver: 'slack-payments'
- matchers:
- team = platform
receiver: 'slack-platform'
# Maintenance / info
- matchers:
- severity = info
receiver: 'email-digest'
repeat_interval: 24h
receivers:
- name: 'fallback'
slack_configs:
- api_url: https://hooks.slack.com/services/XXX
channel: '#alerts'
title: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.cluster }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
- name: 'pagerduty-platform'
pagerduty_configs:
- service_key: $PAGERDUTY_KEY
description: '{{ .GroupLabels.alertname }} ({{ .GroupLabels.cluster }})'
details:
runbook: '{{ .CommonAnnotations.runbook }}'
- name: 'slack-payments'
slack_configs:
- api_url: https://hooks.slack.com/services/PAYMENTS
channel: '#alerts-payments'
inhibit_rules:
- source_matchers: [alertname = NodeDown]
target_matchers: [alertname = PodNotReady]
equal: [cluster, instance]
mute_time_intervals:
- name: business-hours
time_intervals:
- times:
- start_time: 09:00
end_time: 18:00
weekdays: [monday:friday]
location: America/New_York
Common findings this catches
- Same alert spamming Slack → repeat_interval too short.
- Wrong team paged → catch-all matched first; reorder.
- Cascading alerts each pages → inhibition rule needed.
- Critical alert grouped with warnings → separate routes.
- No resolve notification →
send_resolved: trueon receiver. - Test alert silenced incorrectly → mute timing too broad.
- PagerDuty rate-limit → group + dedupe before send.
When to escalate
- Major incident response design — coordinated.
- On-call rotation integration — ops team.
- Multi-channel routing (PD + Slack + ticket) — strategic.
Related prompts
-
Alert Fatigue Reduction Strategy Prompt
Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.
-
Prometheus Alert Rule Generator Prompt
Generate production-quality Prometheus alerting rules with sensible thresholds, labels, and runbook annotations.
-
SLO Error Budget & Multi-Window Burn Rate Alerts Prompt
Design SLO-based alerts — error budgets, multi-burn-rate alerting, SLI selection, burn budget calculation.