Alertmanager Inhibition Rules and Silences Done Right

The first time a rack switch died on me, my phone melted. Not one page, but forty-three: every node behind that switch went NotReady, every pod on those nodes went unschedulable, and every blackbox probe timed out at once. The actual problem was a single line in kubectl describe node. The other forty-two alerts were just echoes. That night taught me a lesson I keep relearning: routing decides who gets an alert, but inhibition and silences decide whether the noise ever reaches them at all. They are a completely different layer of Alertmanager, and most teams underuse them.

This post is about that layer. I’ll show real inhibit_rules YAML, walk through amtool for silences, and explain why the AI-generated version of all this is a great first draft and a terrible last word.

Inhibition Is Not Routing

People conflate the two because both live in alertmanager.yml. They shouldn’t. Routing (the route tree) takes a firing alert and decides which receiver and which grouping it lands in. Inhibition runs before notification and decides whether a firing alert should be muted entirely because a more important alert is already active.

The classic example: if ClusterDown is firing, you do not also want forty NodeDown pages for nodes in that cluster. The cluster alert inhibits the node alerts. They’re still firing in Prometheus, still visible in the UI, but Alertmanager won’t notify on them.

If you want a refresher on the routing side specifically, I wrote that up separately in Alertmanager routing without losing your mind. This post stays on inhibition and silences.

The Anatomy of an inhibit_rule

An inhibition rule has three parts:

source_matchers — the alert that, when firing, does the muting.
target_matchers — the alerts that get muted while a source is active.
equal — the label names that must match identically between source and target for the suppression to apply.

That equal field is the whole game. Without it, a single ClusterDown in cluster A would inhibit NodeDown in cluster B too. equal scopes the suppression to the right blast radius.

inhibit_rules:
  # A whole cluster being down should silence per-node noise in that cluster.
  - source_matchers:
      - 'alertname = "ClusterDown"'
    target_matchers:
      - 'alertname = "NodeDown"'
    equal:
      - cluster

  # A node being down should silence per-pod / probe noise on that node.
  - source_matchers:
      - 'alertname = "NodeDown"'
    target_matchers:
      - 'severity =~ "warning|info"'
    equal:
      - cluster
      - node

Read the second rule carefully: when NodeDown fires, any warning or info alert sharing the same cluster and node is suppressed. Critical alerts are deliberately left alone — you usually still want those, even on a dead node, because they may indicate something the node-down alert doesn’t capture.

Pro Tip: The newer source_matchers / target_matchers list syntax (added in Alertmanager 0.22+) replaces the deprecated source_match and source_match_re maps. The matcher string =~ is regex, = is exact. If you inherited an old config still using source_match_re:, migrate it — the deprecated form is slated for removal.

A Severity-Based Inhibition Pattern

A pattern I lean on constantly: let a higher severity for the same alert on the same target mute the lower one. This stops a single degrading service from paging you twice.

inhibit_rules:
  - source_matchers:
      - 'severity = "critical"'
    target_matchers:
      - 'severity = "warning"'
    equal:
      - alertname
      - cluster
      - service

Now if LatencyHigh{severity="critical"} and LatencyHigh{severity="warning"} are both firing for the same service, only the critical one notifies. The equal list pins it to the same alert and service so you don’t accidentally let an unrelated critical alert swallow a warning you cared about.

The trap here is too few labels in equal. If you only put alertname, one critical latency alert anywhere mutes every warning latency alert everywhere. Always ask: “what is the narrowest set of labels that defines the same real-world thing?”

Silences: The Manual Override

Inhibition is declarative and automatic. Silences are imperative and human-driven — you create them when you know an alert is expected (a maintenance window, a planned reboot, a known-flaky dependency). A silence matches on labels and has an explicit expiry.

The CLI tool is amtool. Create a silence with a duration:

amtool silence add \
  alertname="NodeDown" cluster="prod-eu" node="ip-10-2-3-4" \
  --comment "Planned reboot, INC-4821" \
  --duration 2h \
  --alertmanager.url http://localhost:9093

List active silences and query them:

# Show all active silences
amtool silence query --alertmanager.url http://localhost:9093

# Only silences touching a given alert
amtool silence query alertname="NodeDown"

Expire (end early) a silence by ID once the maintenance finishes:

amtool silence expire <silence-id> --alertmanager.url http://localhost:9093

You can do all of this over the HTTP API too, which is what your runbooks and bots should call:

curl -s http://localhost:9093/api/v2/silences \
  -H 'Content-Type: application/json' \
  -d '{
        "matchers": [
          {"name": "alertname", "value": "NodeDown", "isRegex": false, "isEqual": true},
          {"name": "cluster",   "value": "prod-eu",  "isRegex": false, "isEqual": true}
        ],
        "startsAt": "2026-06-17T09:00:00Z",
        "endsAt":   "2026-06-17T11:00:00Z",
        "createdBy": "deploy-bot",
        "comment": "Planned reboot, INC-4821"
      }'

Note isEqual — set it to false and you’ve built a negative matcher (silence everything except this value), which is occasionally exactly what you want and frequently a foot-gun.

Always Set an Expiry, Never Silence Broadly

The two failure modes of silences are the same failure mode at different scales: silencing too much, for too long.

A silence with no endsAt (or a comically distant one) becomes a permanent blind spot nobody remembers creating. Six months later an incident review discovers the alert that would have caught it was silenced by someone who left the company.

# Too broad — this mutes EVERY alert in the cluster. Almost never what you want.
amtool silence add cluster="prod-eu" --duration 720h --comment "ignore for now"

Don’t do that. Scope to the specific alertname plus the specific instance labels, and use the shortest duration that covers the work. If you genuinely need a recurring quiet window (nightly batch jobs, scheduled backups), that’s a job for inhibition or a time-based mute, not a hand-rolled 30-day silence.

Pro Tip: Run a weekly amtool silence query in cron and post anything expiring in the next 24h — or anything over a few days old — to your team channel. Stale silences are how alerting quietly rots.

Where AI Fits — and Where It Doesn’t

I draft a lot of this config with an AI assistant now, and I treat it exactly like a fast, eager junior engineer: brilliant at boilerplate, occasionally confidently wrong about your topology. It will happily generate a plausible inhibit_rules block in seconds — but it doesn’t know whether your real label is cluster or cluster_id, and it can’t know that your equal list is missing node. The AI produces a draft; a human verifies the matchers against actual firing alerts.

That’s the right division of labor. Let the model write the explainable first version, then review every source_matchers, target_matchers, and equal line against questions only you can answer: What’s the real blast radius? Which severities must always notify? Our free Alert Rule Generator leans into this — it emits readable, commented YAML you can diff and reason about, not a black box you paste blind. The same review discipline applies whether you draft alerts in Claude or anywhere else: output you can explain is output you can ship.

If you also want your alert rules to stop crying wolf in the first place, fewer alerts means fewer inhibition rules to write — I covered that in alert rules that don’t page you falsely. And you’ll find more monitoring write-ups under prometheus-monitoring.

Wrapping Up

Inhibition and silences are the quiet half of Alertmanager, and they’re where you win back your nights. Use inhibit_rules with a tight equal scope to let big alerts suppress their downstream echoes automatically. Use amtool and the API for human silences — always scoped, always expiring. And whether you write the YAML by hand or have an AI draft it, read it like the junior-engineer output it is: explainable, reviewable, and shipped only once a human nods.