Routing Monitoring Alerts to Slack Without Drowning in Noise

The fastest way to make an on-call engineer ignore Slack is to dump every alert into one channel. I’ve watched teams pipe a raw firehose into #alerts and then act surprised when nobody notices the real outage buried under forty flapping warnings. After 25 years of running this stuff, I’m convinced alert routing is mostly an exercise in removing noise, not adding integrations.

Here’s how I route alerts to Slack so the signal actually reaches a human.

Start with the channel topology

Before any config, decide where alerts go. My default layout:

#alerts-critical — pages-worthy, customer-impacting. This channel earns notifications on.
#alerts-warning — degradation, capacity headroom, things to look at during business hours.
#alerts-<team> — routed by ownership label so each team sees only their services.

One channel for everything guarantees alert fatigue. Routing by severity and ownership is what keeps each channel quiet enough that a new message means something.

Configure Alertmanager routing

The routing tree in Alertmanager is where the work happens. A trimmed example:

route:
  receiver: slack-warning
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers: [ 'severity="critical"' ]
      receiver: slack-critical
      repeat_interval: 1h
    - matchers: [ 'team="payments"' ]
      receiver: slack-payments

receivers:
  - name: slack-critical
    slack_configs:
      - channel: '#alerts-critical'
        send_resolved: true
        title: '{{ .CommonLabels.alertname }}'
        text: >-
          {{ range .Alerts }}*{{ .Labels.severity }}* {{ .Annotations.summary }}
          {{ end }}

Three settings do most of the noise reduction:

group_by collapses twenty pod alerts into one grouped message. This alone cuts message volume dramatically.
group_wait and group_interval batch a burst into a single notification instead of a spray.
send_resolved: true so the channel shows recovery, not just firing. A channel that only ever shows problems and never shows them clearing trains people to distrust it.

Make the message tell you what to do

A good alert message answers three questions without a click: what’s broken, how bad, and where do I look. Put the runbook link, the dashboard link, and the severity right in the message via annotations:

annotations:
  summary: "p99 checkout latency {{ $value }}s (>1s for 5m)"
  runbook: "https://runbooks.internal/checkout-latency"
  dashboard: "https://grafana.internal/d/checkout"

If an engineer has to leave Slack to figure out whether an alert matters, the message has already failed.

Kill the flapping alerts

Flapping — an alert that fires and resolves every few minutes — is the top destroyer of channel trust. Fixes, in order of preference:

Add a for: duration to the alerting rule so a metric must stay bad before it fires. Most flapping dies here.
Tune the threshold if it’s genuinely too sensitive.
Inhibit dependent alerts so a node-down alert suppresses the fifty pod-down alerts it caused.

Inhibition rules are underused and they’re magic for cutting cascade noise:

inhibit_rules:
  - source_matchers: [ 'alertname="NodeDown"' ]
    target_matchers: [ 'severity="warning"' ]
    equal: ['node']

Add AI summaries for the noisy moments

When a real incident hits, the channel still fills up fast — that’s unavoidable. This is where AI earns its place. I pipe the grouped alert payload through an LLM with a tight prompt:

“Here are the alerts that fired in the last 10 minutes. Group them by likely root cause, name the single most probable underlying failure, and list which alerts are symptoms of it. Do not suggest any remediation.”

The bot posts a one-paragraph summary above the raw alerts: “Most of these are downstream of the database connection-pool exhaustion that started at 02:09.” That’s the difference between an on-call engineer reading forty messages and reading one.

The boundary holds firm: AI summarizes and correlates; it never silences or fires alerts, and it never remediates. It’s a reading aid layered on top of deterministic routing, not a replacement for it.

A pre-launch checklist

Before you point production alerts at Slack:

Every critical alert has a runbook link in its annotation.
Channels are split by severity and ownership, not lumped together.
for: durations exist on anything prone to flapping.
send_resolved is on so recoveries show.
Inhibition rules suppress obvious cascades.
The Slack webhook URL lives in a secrets manager, not the repo.

The real goal

The metric I care about isn’t “alerts delivered to Slack.” It’s “every message in #alerts-critical got read and acted on.” Routing, grouping, inhibition, and a tight for: clause get you most of the way there; AI summaries handle the genuinely busy moments.

For the prompt patterns I use to summarize and correlate alert storms, see our Slack and alerting prompts and the full prompt library.

Routing config is deterministic and lives in your repo. Keep AI in the read-and-summarize lane and let Alertmanager own what fires.