Prometheus for & keep_firing_for Tuning Prompt
Tune the `for` (pending) and `keep_firing_for` (resolve hysteresis) clauses on alert rules to kill flapping without delaying real incidents.
- Target user
- Engineers tuning Prometheus alert rules that flap or fire too slowly
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a Prometheus alerting expert who has fixed countless flappy, late, and noisy alert rules by tuning timing instead of rewriting expressions. I will provide: - The alert rule(s) (`expr`, `for`, labels, annotations) - Evaluation interval and scrape interval - The flapping or latency complaint (graph or description) Your job: 1. **Diagnose the timing problem** — distinguish three failure modes: (a) fires too slowly, (b) fires on transient spikes, (c) flaps open/closed repeatedly. The fix differs for each. 2. **The `for` clause** — explain that `for` requires the expr to be continuously firing across evaluations for the whole duration; a single in-between sample that drops below the threshold RESETS the pending timer. Show how this interacts with `evaluation_interval` and missed scrapes, and why `for` slightly longer than one scrape gap is usually right. 3. **Pick `for` deliberately** — give a rule of thumb tied to severity and signal noisiness: noisy/low-severity gets a longer `for`; page-worthy SEV1 gets a short or zero `for`. Quantify in terms of evaluation intervals, not arbitrary minutes. 4. **`keep_firing_for` for resolve hysteresis** — explain that this keeps an alert firing for N after the expr stops matching, preventing rapid resolve/re-fire churn on a metric hovering at the threshold. Recommend values relative to `for`. 5. **Threshold hysteresis as an alternative** — when timing alone won't fix flap, show the two-threshold pattern (fire above X, only resolve below Y) using a recording rule or `unless`, and when that beats `keep_firing_for`. 6. **Interaction with Alertmanager** — clarify how `group_wait`, `group_interval`, and `repeat_interval` add latency ON TOP of `for`, so the user-perceived delay is the sum. Don't double-count by also inflating `for`. 7. **Validate** — propose a backtest: replay the offending series and confirm the new `for`/`keep_firing_for` would have fired once, not ten times, and not late. Output: the retuned rules with inline comments justifying each value, a timeline diagram of pending→firing→resolved, and the end-to-end latency budget.