Alertmanager Grouping Timers: group_wait, group_interval, and repeat_interval
The three Alertmanager grouping timers are constantly confused. Here's what each one actually controls and how to tune them so pages batch sensibly without re-paging noise.
- #prometheus-monitoring
- #ai
- #alertmanager
- #on-call
- #alert-fatigue
Three Alertmanager settings cause more on-call confusion than any others: group_wait, group_interval, and repeat_interval. Their names sound similar, the docs describe them tersely, and the result is that teams reach for the wrong one constantly — shortening group_wait to “get paged faster” and drowning in separate notifications, or shrinking repeat_interval and training everyone to ignore the page channel. These are three independent timers solving three different problems, not stages in a sequence. Once that clicks, tuning them is straightforward. Until it does, every change is a guess.
What each timer controls
Picture a single incident that fires five related alerts over the next minute.
-
group_wait— how long Alertmanager waits before sending the first notification for a brand-new group. Its job is to let co-firing alerts batch into one page instead of five. A typical value is 30 seconds: long enough to collect the burst, short enough to stay timely. -
group_interval— once a group has already notified, how long Alertmanager waits before sending an updated notification when new alerts join that group. This controls how quickly you hear about an incident spreading. A typical value is 5 minutes. -
repeat_interval— for alerts that are still firing and unchanged, how long before Alertmanager re-sends the same notification as a reminder. This is the “you still have an open problem” nudge. A typical value is several hours.
route:
group_by: [alertname, cluster, service]
group_wait: 30s # batch the initial burst
group_interval: 5m # tell me when new alerts join
repeat_interval: 4h # remind me it's still broken
They’re independent: group_wait fires once per group, group_interval governs updates, repeat_interval governs reminders.
Map the symptom to the timer
Tuning almost always starts with a specific complaint from a retro. The mapping:
- “One incident paged me five separate times.” ->
group_waittoo short, orgroup_bytoo granular. Usually the latter. - “The follow-up alert came in way too late.” ->
group_intervaltoo long. - “It kept re-paging me about a problem I already acknowledged.” ->
repeat_intervaltoo short.
Nudge all three and you’ll fix the symptom by accident and create two new ones. Target the right timer and the change is surgical.
The hidden variable: group_by
Here’s what trips people up most. group_wait can’t batch alerts that aren’t in the same group, and grouping is decided by group_by. If your group_by includes a high-cardinality label like instance or pod, then every alert lands in its own group and group_wait batches nothing — five alerts, five groups, five pages, no matter how high you set the timer.
# Defeats batching: every instance is its own group
group_by: [alertname, instance]
# Batches sensibly: groups by the unit you actually respond to
group_by: [alertname, cluster, service]
Choose group_by to match how you want to be paged — by the service or cluster you’d actually go look at — not by the most granular label available. A team that lowers group_wait, sees no change, and concludes the feature is broken almost always has an over-granular group_by.
Tuning with AI as a second opinion
Because these timers interact, talking the change through with an assistant helps — as long as you make it distinguish the three explicitly and verify with a test:
My on-call complaint is that one incident pages me several separate times. Here’s my route block: group_by is [alertname, instance], group_wait 30s, group_interval 5m, repeat_interval 4h. Which timer or setting is wrong, and what should I change?
The culprit is
group_by, not the timers. Includinginstanceputs every host’s alert in its own group, sogroup_waitcan’t batch them — that’s why you get separate pages. Changegroup_byto[alertname, cluster, service]so co-firing alerts on one service collapse into a single notification. Leave the timers as-is; they’re reasonable. Verify by firing two alerts on the same service and confirming you get one page after ~30s, not two.
The model drafts the diagnosis; you verify by firing test alerts and watching the actual notification timing. That fire-and-watch step is the human-verifies half — cadence is something you confirm with your own eyes before it governs your rotation. The same approach runs through the Alertmanager and alerting prompts.
A safety floor on repeat_interval
One guardrail worth hardcoding into your judgment: never set repeat_interval shorter than the time it realistically takes to acknowledge and act on a page. Re-pages that arrive faster than a human can respond don’t add urgency — they add noise, and noise trains responders to mute the channel. If your worst incidents take 30 minutes to stabilize, a 10-minute repeat_interval just means three useless reminders before anyone’s done anything actionable.
The bottom line
group_wait batches the opening burst, group_interval tells you when an incident spreads, and repeat_interval reminds you it’s still open — three independent timers, not a pipeline. Tune by mapping the specific complaint to the specific timer, and check group_by first, because over-granular grouping silently defeats batching no matter what the timers say. For a structured way to turn a retro complaint into a concrete route-block diff, the grouping timers tuning prompt and the routing and grouping prompt walk it end to end, verification step included.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.