Alertmanager group_wait, group_interval & repeat_interval Tuning Prompt
Tune Alertmanager grouping and repeat timers so related alerts batch into one notification, follow-ups are timely, and re-pages don't become noise.
- Target user
- On-call engineers tuning Alertmanager notification cadence
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an Alertmanager expert who treats group_wait, group_interval, and repeat_interval as three distinct timers that solve three distinct problems, and who never copies defaults blindly. I will provide: - My current route block with the three timers (or "defaults"): [ROUTE CONFIG] - The kind of alerts on this route (fast-firing infra, slow SLO burn, batch jobs): [ALERT PROFILE] - The complaint driving the change (too many separate pages, follow-up came too late, re-paged on a known issue): [SYMPTOM] - The receiver (PagerDuty, Slack, email) and whether it dedupes on its side: [RECEIVER] Your job: 1. **Define each timer precisely** — in one line each: - group_wait: how long to wait before the FIRST notification for a new group, so co-firing alerts batch together. - group_interval: how long to wait before sending an updated notification when NEW alerts join an existing group. - repeat_interval: how long before re-sending a notification for alerts that are STILL firing and unchanged. Make clear these are independent, not a sequence. 2. **Map symptom to timer** — for my complaint, identify which timer is wrong. (Too many separate pages on one incident -> group_wait too short or grouping labels too narrow. Late follow-ups -> group_interval too long. Annoying re-pages -> repeat_interval too short.) 3. **Account for grouping labels** — point out that group_by interacts with all three; over-narrow group_by (e.g. grouping by instance) defeats group_wait entirely. Recommend group_by labels that match how I want to be paged. 4. **Propose values with rationale** — give specific values tied to my alert profile, and explain the trade-off (lower group_wait = faster but noisier first page). 5. **Show the diff** — present the before/after route block. Output as: (a) a 3-row table defining the timers in my words, (b) which timer my symptom maps to, (c) the proposed route block diff with inline comments, (d) one sentence on how to verify (fire a test alert and watch notification timing). Distinguish the three timers explicitly every time. Never recommend a repeat_interval shorter than the time it realistically takes to acknowledge and act on a page.
Why this prompt works
The three Alertmanager grouping timers are constantly confused for one another because their names sound similar and the documentation describes them tersely. Engineers reach for repeat_interval when they mean group_interval, or shorten group_wait to “get paged faster” and end up with a flood of separate notifications for one incident. This prompt’s first and most important move is to force the model to define all three in one line each and to assert they are independent timers, not stages of a pipeline. Once that mental model is correct, tuning becomes obvious; without it, every change is a guess.
The prompt is symptom-driven, which matches how this problem actually arrives: nobody tunes these timers for fun, they tune them because a specific complaint landed in a retro. By mapping the complaint (“too many pages,” “follow-up too late,” “re-paged on a known issue”) to the specific timer at fault, the model gives you a targeted change instead of nudging all three values and hoping. It also surfaces the hidden variable most people miss — group_by. An over-granular grouping key defeats group_wait entirely, so a team can lower the timer, see no improvement, and conclude the feature is broken when the real issue is grouping labels.
Finally, the safety guardrail around repeat_interval reflects a real human cost: re-pages that arrive faster than a responder can act don’t add urgency, they add noise, and noise trains people to mute the channel. Pairing the concrete route-block diff with a verification step (fire a test alert, watch the timing) keeps this firmly in AI-drafts, human-verifies territory — you get a specific config, but you confirm the cadence with your own eyes before it governs your on-call rotation.
Related prompts
-
Alert Fatigue Reduction Strategy Prompt
Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.
-
Alertmanager Routing, Grouping & Receivers Prompt
Design Alertmanager routes — receivers (Slack, PagerDuty), grouping, inhibition, repeat intervals, mute timings.