You are a senior SRE who has designed Alertmanager routing for production incidents. You know how to balance noise (group properly), urgency (route to right channel), and recovery (timely resolve notifications). I will provide: - Current Alertmanager config - Symptom (too noisy, missed alerts, wrong recipient, no resolve) - Alert volume / team structure Your job: 1. **Receivers**: - Slack — channel + webhook - PagerDuty — integration key - Email - Webhook for custom - OpsGenie, VictorOps, etc. 2. **Routing tree**: - Top-level matches alerts (e.g., severity) - Child routes for finer matching - `continue: true` for matching multiple - Default fallback at root 3. **Grouping**: - `group_by` labels combine related alerts into single notification - `group_wait` — initial delay to gather more - `group_interval` — within group, how often to update - `repeat_interval` — re-notify if unresolved 4. **Inhibition**: - "If alert X firing, suppress alert Y" - Useful for cascading failures - Example: node down → suppress per-pod alerts 5. **For mute timings** (newer): - Time-based suppression - E.g., mute non-critical during deploys 6. **For severity tiers**: - critical → PagerDuty - warning → Slack - info → email digest 7. **For team-based routing**: - `team` label on alerts - Route per team to their channel 8. **For multi-tenant**: - tenant label - Separate Alertmanager instances or careful routing Mark DESTRUCTIVE: catch-all route to wrong channel (mass missed), repeat_interval too short (alert spam), inhibition rule that suppresses everything. --- Current config: ```yaml [PASTE] ``` Symptom: [DESCRIBE] Team structure: [DESCRIBE]

Why this prompt works

Alertmanager routing is config-heavy. This prompt walks the structure.

How to use it

Start with severity routing.
Add team-based routes.
Apply inhibition for cascades.
Tune grouping for clarity.

Useful commands

# Check config syntax
amtool check-config /etc/alertmanager/alertmanager.yml

# List silences
amtool silence query

# Create silence
amtool silence add alertname=HighCPU --duration=2h --comment="planned"

# View current routing
amtool config routes show

# Test routing for an alert label set
amtool config routes test severity=critical team=payments

# View Alertmanager status (via API)
curl http://alertmanager:9093/api/v2/status
curl http://alertmanager:9093/api/v2/alerts | jq

Routing example

route:
  group_by: [alertname, cluster, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: 'fallback'

  routes:
  # Critical → PagerDuty
  - matchers:
    - severity = critical
    receiver: 'pagerduty-platform'
    group_wait: 10s         # faster for critical
    continue: true          # also send to Slack

  # Per-team Slack
  - matchers:
    - team = payments
    receiver: 'slack-payments'

  - matchers:
    - team = platform
    receiver: 'slack-platform'

  # Maintenance / info
  - matchers:
    - severity = info
    receiver: 'email-digest'
    repeat_interval: 24h

receivers:
- name: 'fallback'
  slack_configs:
  - api_url: https://hooks.slack.com/services/XXX
    channel: '#alerts'
    title: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.cluster }}'
    text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

- name: 'pagerduty-platform'
  pagerduty_configs:
  - service_key: $PAGERDUTY_KEY
    description: '{{ .GroupLabels.alertname }} ({{ .GroupLabels.cluster }})'
    details:
      runbook: '{{ .CommonAnnotations.runbook }}'

- name: 'slack-payments'
  slack_configs:
  - api_url: https://hooks.slack.com/services/PAYMENTS
    channel: '#alerts-payments'

inhibit_rules:
- source_matchers: [alertname = NodeDown]
  target_matchers: [alertname = PodNotReady]
  equal: [cluster, instance]

mute_time_intervals:
- name: business-hours
  time_intervals:
  - times:
    - start_time: 09:00
      end_time: 18:00
    weekdays: [monday:friday]
    location: America/New_York

Common findings this catches

Same alert spamming Slack → repeat_interval too short.
Wrong team paged → catch-all matched first; reorder.
Cascading alerts each pages → inhibition rule needed.
Critical alert grouped with warnings → separate routes.
No resolve notification → send_resolved: true on receiver.
Test alert silenced incorrectly → mute timing too broad.
PagerDuty rate-limit → group + dedupe before send.

When to escalate

Major incident response design — coordinated.
On-call rotation integration — ops team.
Multi-channel routing (PD + Slack + ticket) — strategic.

Alertmanager Routing, Grouping & Receivers Prompt

Why this prompt works

How to use it

Useful commands

Routing example

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

Prometheus Alert Rule Generator Prompt

SLO Error Budget & Multi-Window Burn Rate Alerts Prompt

Why this prompt works

How to use it

Useful commands

Routing example

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

Prometheus Alert Rule Generator Prompt

SLO Error Budget & Multi-Window Burn Rate Alerts Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet