Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Alertmanager Routing, Grouping & Receivers Prompt

Design Alertmanager routes — receivers (Slack, PagerDuty), grouping, inhibition, repeat intervals, mute timings.

Target user
SREs configuring Alertmanager
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior SRE who has designed Alertmanager routing for production incidents. You know how to balance noise (group properly), urgency (route to right channel), and recovery (timely resolve notifications).

I will provide:
- Current Alertmanager config
- Symptom (too noisy, missed alerts, wrong recipient, no resolve)
- Alert volume / team structure

Your job:

1. **Receivers**:
   - Slack — channel + webhook
   - PagerDuty — integration key
   - Email
   - Webhook for custom
   - OpsGenie, VictorOps, etc.
2. **Routing tree**:
   - Top-level matches alerts (e.g., severity)
   - Child routes for finer matching
   - `continue: true` for matching multiple
   - Default fallback at root
3. **Grouping**:
   - `group_by` labels combine related alerts into single notification
   - `group_wait` — initial delay to gather more
   - `group_interval` — within group, how often to update
   - `repeat_interval` — re-notify if unresolved
4. **Inhibition**:
   - "If alert X firing, suppress alert Y"
   - Useful for cascading failures
   - Example: node down → suppress per-pod alerts
5. **For mute timings** (newer):
   - Time-based suppression
   - E.g., mute non-critical during deploys
6. **For severity tiers**:
   - critical → PagerDuty
   - warning → Slack
   - info → email digest
7. **For team-based routing**:
   - `team` label on alerts
   - Route per team to their channel
8. **For multi-tenant**:
   - tenant label
   - Separate Alertmanager instances or careful routing

Mark DESTRUCTIVE: catch-all route to wrong channel (mass missed), repeat_interval too short (alert spam), inhibition rule that suppresses everything.

---

Current config:
```yaml
[PASTE]
```
Symptom: [DESCRIBE]
Team structure: [DESCRIBE]

Why this prompt works

Alertmanager routing is config-heavy. This prompt walks the structure.

How to use it

  1. Start with severity routing.
  2. Add team-based routes.
  3. Apply inhibition for cascades.
  4. Tune grouping for clarity.

Useful commands

# Check config syntax
amtool check-config /etc/alertmanager/alertmanager.yml

# List silences
amtool silence query

# Create silence
amtool silence add alertname=HighCPU --duration=2h --comment="planned"

# View current routing
amtool config routes show

# Test routing for an alert label set
amtool config routes test severity=critical team=payments

# View Alertmanager status (via API)
curl http://alertmanager:9093/api/v2/status
curl http://alertmanager:9093/api/v2/alerts | jq

Routing example

route:
  group_by: [alertname, cluster, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'fallback'

  routes:
  # Critical → PagerDuty
  - matchers:
    - severity = critical
    receiver: 'pagerduty-platform'
    group_wait: 10s         # faster for critical
    continue: true          # also send to Slack

  # Per-team Slack
  - matchers:
    - team = payments
    receiver: 'slack-payments'

  - matchers:
    - team = platform
    receiver: 'slack-platform'

  # Maintenance / info
  - matchers:
    - severity = info
    receiver: 'email-digest'
    repeat_interval: 24h

receivers:
- name: 'fallback'
  slack_configs:
  - api_url: https://hooks.slack.com/services/XXX
    channel: '#alerts'
    title: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.cluster }}'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

- name: 'pagerduty-platform'
  pagerduty_configs:
  - service_key: $PAGERDUTY_KEY
    description: '{{ .GroupLabels.alertname }} ({{ .GroupLabels.cluster }})'
    details:
      runbook: '{{ .CommonAnnotations.runbook }}'

- name: 'slack-payments'
  slack_configs:
  - api_url: https://hooks.slack.com/services/PAYMENTS
    channel: '#alerts-payments'

inhibit_rules:
- source_matchers: [alertname = NodeDown]
  target_matchers: [alertname = PodNotReady]
  equal: [cluster, instance]

mute_time_intervals:
- name: business-hours
  time_intervals:
  - times:
    - start_time: 09:00
      end_time: 18:00
    weekdays: [monday:friday]
    location: America/New_York

Common findings this catches

  • Same alert spamming Slack → repeat_interval too short.
  • Wrong team paged → catch-all matched first; reorder.
  • Cascading alerts each pages → inhibition rule needed.
  • Critical alert grouped with warnings → separate routes.
  • No resolve notificationsend_resolved: true on receiver.
  • Test alert silenced incorrectly → mute timing too broad.
  • PagerDuty rate-limit → group + dedupe before send.

When to escalate

  • Major incident response design — coordinated.
  • On-call rotation integration — ops team.
  • Multi-channel routing (PD + Slack + ticket) — strategic.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week