Alertmanager Routing Without Losing Your Mind
Alertmanager's routing tree, grouping, and inhibition decide who gets paged and when. Here's how I configure it so the right person hears the right alert.
- #prometheus
- #alertmanager
- #alerting
- #sre
- #on-call
- #monitoring
Prometheus decides what is wrong. Alertmanager decides who finds out, how, and when. People spend weeks perfecting alert rules and then route everything to one Slack channel, which is like writing a beautiful letter and mailing it to a black hole. After years of running this layer, here’s how I keep Alertmanager sane.
The routing tree, explained simply
Alertmanager has exactly one root route, and every alert enters there. From the root, alerts fall down a tree of child routes, matching on labels. The first matching branch (with continue: false, the default) wins.
route:
receiver: default-slack # fallback for anything unmatched
group_by: ['alertname', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: page
receiver: pagerduty
routes:
- match:
team: payments
receiver: payments-pagerduty
- match:
severity: ticket
receiver: jira
Read it top-down: a severity: page alert from the payments team lands on payments-pagerduty. A severity: ticket alert goes to Jira. Everything else falls through to the default Slack channel. The nesting matters — child routes inherit and refine the parent’s match.
Grouping: one notification, not a hundred
When a database goes down, fifty services start erroring at once. Without grouping, that’s fifty pages. Grouping bundles related alerts into a single notification.
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # wait 30s to collect related alerts before first send
group_interval: 5m # then batch updates to the group every 5m
repeat_interval: 4h # re-page about an unresolved group every 4h
group_by is the lever. Group too broadly (group_by: ['cluster']) and unrelated alerts get merged into one confusing notification. Group too narrowly (group_by: ['...all labels...']) and you’re back to a page per series. I usually group by alertname plus the service or cluster — enough to collapse the storm without mixing apples and oranges.
A trap worth knowing: group_by: [...] with the literal '...' value means “group by all labels,” which effectively disables grouping. People set it by accident and wonder why the noise came back.
Inhibition: suppress the obvious follow-ons
When a whole datacenter is unreachable, you don’t also want a page for every service inside it. Inhibition rules let one alert mute others.
inhibit_rules:
- source_match:
alertname: DatacenterUnreachable
target_match:
severity: page
equal: ['datacenter']
This says: if DatacenterUnreachable is firing, suppress any other page alert sharing the same datacenter label. The cause pages you; the symptoms stay quiet. Done well, inhibition turns a 40-page storm into one actionable page.
Silences: planned quiet
Before a deploy or maintenance window, you silence the alerts you know will fire. Silences match on labels and expire automatically.
amtool silence add alertname="HighLatency" service="checkout" \
--duration="2h" --comment="Deploy in progress, see CHG-1234"
Always set a duration and a comment. A silence with no expiry is how a real outage gets ignored for a week because someone muted it during a migration in March and forgot. Expiring silences are non-negotiable.
Receivers: match the channel to the urgency
A receiver is just a destination plus its config. The discipline is matching channel to urgency:
receivers:
- name: pagerduty
pagerduty_configs:
- routing_key: <key>
- name: jira
webhook_configs:
- url: http://jira-bridge/alert
- name: default-slack
slack_configs:
- channel: '#alerts-firehose'
send_resolved: true
Set send_resolved: true so people know when something recovers — an alert that never sends an “all clear” leaves everyone wondering. For high-volume info channels, resolved notifications are a kindness.
Test the route before you trust it
amtool will tell you exactly where a hypothetical alert would land, without firing anything:
amtool config routes test \
--config.file=alertmanager.yml \
severity=page team=payments
# -> payments-pagerduty
I run this against every new route. It’s the difference between “I think payments alerts go to the right place” and “I proved it.” Routing bugs are invisible until the night they swallow a real page.
Where AI fits
The routing tree, grouping windows, and inhibition rules are fiddly YAML with non-obvious semantics. I describe the org in plain terms — “payment alerts page the payments team, infra alerts page platform, everything else goes to a Slack channel, and a datacenter-down alert should suppress its children” — and let AI scaffold the route tree and inhibition rules. Then I validate with amtool and adjust.
It gets you a structurally-sound starting config in seconds instead of an afternoon of cross-referencing the docs. Our monitoring prompts include routing-tree templates, and the Alert Rule Generator emits alerts with the severity and team labels that this routing depends on.
The goal
Good Alertmanager config is invisible. The right person gets one clear notification, the storm collapses to a single page, planned maintenance stays quiet, and nothing important falls through. Get the routing tree, grouping, and inhibition right, and Alertmanager stops being the thing that drowns your team and becomes the thing that protects it.
Generated Alertmanager configs are assistive, not authoritative. Always validate routing with amtool and test in a staging instance before production.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.