Alertmanager Grouping Timers: group_wait, group

Three Alertmanager settings cause more on-call confusion than any others: group_wait, group_interval, and repeat_interval. Their names sound similar, the docs describe them tersely, and the result is that teams reach for the wrong one constantly — shortening group_wait to “get paged faster” and drowning in separate notifications, or shrinking repeat_interval and training everyone to ignore the page channel. These are three independent timers solving three different problems, not stages in a sequence. Once that clicks, tuning them is straightforward. Until it does, every change is a guess.

What each timer controls

Picture a single incident that fires five related alerts over the next minute.

group_wait — how long Alertmanager waits before sending the first notification for a brand-new group. Its job is to let co-firing alerts batch into one page instead of five. A typical value is 30 seconds: long enough to collect the burst, short enough to stay timely.
group_interval — once a group has already notified, how long Alertmanager waits before sending an updated notification when new alerts join that group. This controls how quickly you hear about an incident spreading. A typical value is 5 minutes.
repeat_interval — for alerts that are still firing and unchanged, how long before Alertmanager re-sends the same notification as a reminder. This is the “you still have an open problem” nudge. A typical value is several hours.

route:
  group_by: [alertname, cluster, service]
  group_wait: 30s        # batch the initial burst
  group_interval: 5m     # tell me when new alerts join
  repeat_interval: 4h    # remind me it's still broken

They’re independent: group_wait fires once per group, group_interval governs updates, repeat_interval governs reminders.

Map the symptom to the timer

Tuning almost always starts with a specific complaint from a retro. The mapping:

“One incident paged me five separate times.” -> group_wait too short, or group_by too granular. Usually the latter.
“The follow-up alert came in way too late.” -> group_interval too long.
“It kept re-paging me about a problem I already acknowledged.” -> repeat_interval too short.

Nudge all three and you’ll fix the symptom by accident and create two new ones. Target the right timer and the change is surgical.

The hidden variable: group_by

Here’s what trips people up most. group_wait can’t batch alerts that aren’t in the same group, and grouping is decided by group_by. If your group_by includes a high-cardinality label like instance or pod, then every alert lands in its own group and group_wait batches nothing — five alerts, five groups, five pages, no matter how high you set the timer.

# Defeats batching: every instance is its own group
group_by: [alertname, instance]

# Batches sensibly: groups by the unit you actually respond to
group_by: [alertname, cluster, service]

Choose group_by to match how you want to be paged — by the service or cluster you’d actually go look at — not by the most granular label available. A team that lowers group_wait, sees no change, and concludes the feature is broken almost always has an over-granular group_by.

Tuning with AI as a second opinion

Because these timers interact, talking the change through with an assistant helps — as long as you make it distinguish the three explicitly and verify with a test:

My on-call complaint is that one incident pages me several separate times. Here’s my route block: group_by is [alertname, instance], group_wait 30s, group_interval 5m, repeat_interval 4h. Which timer or setting is wrong, and what should I change?

The culprit is group_by, not the timers. Including instance puts every host’s alert in its own group, so group_wait can’t batch them — that’s why you get separate pages. Change group_by to [alertname, cluster, service] so co-firing alerts on one service collapse into a single notification. Leave the timers as-is; they’re reasonable. Verify by firing two alerts on the same service and confirming you get one page after ~30s, not two.

The model drafts the diagnosis; you verify by firing test alerts and watching the actual notification timing. That fire-and-watch step is the human-verifies half — cadence is something you confirm with your own eyes before it governs your rotation. The same approach runs through the Alertmanager and alerting prompts.

A safety floor on repeat_interval

One guardrail worth hardcoding into your judgment: never set repeat_interval shorter than the time it realistically takes to acknowledge and act on a page. Re-pages that arrive faster than a human can respond don’t add urgency — they add noise, and noise trains responders to mute the channel. If your worst incidents take 30 minutes to stabilize, a 10-minute repeat_interval just means three useless reminders before anyone’s done anything actionable.

The bottom line

group_wait batches the opening burst, group_interval tells you when an incident spreads, and repeat_interval reminds you it’s still open — three independent timers, not a pipeline. Tune by mapping the specific complaint to the specific timer, and check group_by first, because over-granular grouping silently defeats batching no matter what the timers say. For a structured way to turn a retro complaint into a concrete route-block diff, the grouping timers tuning prompt and the routing and grouping prompt walk it end to end, verification step included.

Alertmanager Grouping Timers: group_wait, group_interval, and repeat_interval