Alertmanager Routing Without Losing Your Mind

Prometheus decides what is wrong. Alertmanager decides who finds out, how, and when. People spend weeks perfecting alert rules and then route everything to one Slack channel, which is like writing a beautiful letter and mailing it to a black hole. After years of running this layer, here’s how I keep Alertmanager sane.

The routing tree, explained simply

Alertmanager has exactly one root route, and every alert enters there. From the root, alerts fall down a tree of child routes, matching on labels. The first matching branch (with continue: false, the default) wins.

route:
  receiver: default-slack          # fallback for anything unmatched
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: page
      receiver: pagerduty
      routes:
        - match:
            team: payments
          receiver: payments-pagerduty
    - match:
        severity: ticket
      receiver: jira

Read it top-down: a severity: page alert from the payments team lands on payments-pagerduty. A severity: ticket alert goes to Jira. Everything else falls through to the default Slack channel. The nesting matters — child routes inherit and refine the parent’s match.

Grouping: one notification, not a hundred

When a database goes down, fifty services start erroring at once. Without grouping, that’s fifty pages. Grouping bundles related alerts into a single notification.

group_by: ['alertname', 'cluster', 'service']
group_wait: 30s       # wait 30s to collect related alerts before first send
group_interval: 5m    # then batch updates to the group every 5m
repeat_interval: 4h   # re-page about an unresolved group every 4h

group_by is the lever. Group too broadly (group_by: ['cluster']) and unrelated alerts get merged into one confusing notification. Group too narrowly (group_by: ['...all labels...']) and you’re back to a page per series. I usually group by alertname plus the service or cluster — enough to collapse the storm without mixing apples and oranges.

A trap worth knowing: group_by: [...] with the literal '...' value means “group by all labels,” which effectively disables grouping. People set it by accident and wonder why the noise came back.

Inhibition: suppress the obvious follow-ons

When a whole datacenter is unreachable, you don’t also want a page for every service inside it. Inhibition rules let one alert mute others.

inhibit_rules:
  - source_match:
      alertname: DatacenterUnreachable
    target_match:
      severity: page
    equal: ['datacenter']

This says: if DatacenterUnreachable is firing, suppress any other page alert sharing the same datacenter label. The cause pages you; the symptoms stay quiet. Done well, inhibition turns a 40-page storm into one actionable page.

Silences: planned quiet

Before a deploy or maintenance window, you silence the alerts you know will fire. Silences match on labels and expire automatically.

amtool silence add alertname="HighLatency" service="checkout" \
  --duration="2h" --comment="Deploy in progress, see CHG-1234"

Always set a duration and a comment. A silence with no expiry is how a real outage gets ignored for a week because someone muted it during a migration in March and forgot. Expiring silences are non-negotiable.

Receivers: match the channel to the urgency

A receiver is just a destination plus its config. The discipline is matching channel to urgency:

receivers:
  - name: pagerduty
    pagerduty_configs:
      - routing_key: <key>
  - name: jira
    webhook_configs:
      - url: http://jira-bridge/alert
  - name: default-slack
    slack_configs:
      - channel: '#alerts-firehose'
        send_resolved: true

Set send_resolved: true so people know when something recovers — an alert that never sends an “all clear” leaves everyone wondering. For high-volume info channels, resolved notifications are a kindness.

Test the route before you trust it

amtool will tell you exactly where a hypothetical alert would land, without firing anything:

amtool config routes test \
  --config.file=alertmanager.yml \
  severity=page team=payments
# -> payments-pagerduty

I run this against every new route. It’s the difference between “I think payments alerts go to the right place” and “I proved it.” Routing bugs are invisible until the night they swallow a real page.

Where AI fits

The routing tree, grouping windows, and inhibition rules are fiddly YAML with non-obvious semantics. I describe the org in plain terms — “payment alerts page the payments team, infra alerts page platform, everything else goes to a Slack channel, and a datacenter-down alert should suppress its children” — and let AI scaffold the route tree and inhibition rules. Then I validate with amtool and adjust.

It gets you a structurally-sound starting config in seconds instead of an afternoon of cross-referencing the docs. Our monitoring prompts include routing-tree templates, and the Alert Rule Generator emits alerts with the severity and team labels that this routing depends on.

The goal

Good Alertmanager config is invisible. The right person gets one clear notification, the storm collapses to a single page, planned maintenance stays quiet, and nothing important falls through. Get the routing tree, grouping, and inhibition right, and Alertmanager stops being the thing that drowns your team and becomes the thing that protects it.

Generated Alertmanager configs are assistive, not authoritative. Always validate routing with amtool and test in a staging instance before production.