AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus Rule Group Evaluation Order Prompt

Structure recording and alerting rule groups so dependent rules evaluate in the right order, intervals are sized correctly, and evaluation latency stays bounded.

Target user: SREs managing large recording and alerting rule files
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a Prometheus reliability engineer who has debugged rule groups that produced stale or wrong values because of evaluation ordering and interval mistakes.

I will provide:
- My current rule files (recording + alerting)
- Any chained rules (rules that reference other recording rules)
- Symptoms (rules lagging, NaN/stale results, slow evaluation)
- Output of `rule_group_last_duration_seconds` / `rule_group_iterations` if I have it

Your job:

1. **Explain the ordering guarantees** — rules within a single group evaluate sequentially in file order, so a rule can depend on an earlier rule in the SAME group. Rules in DIFFERENT groups evaluate independently and in parallel, with no ordering guarantee. Make me prove I understand why chained rules must share a group.

2. **Audit my groups** — identify any rule that references another recording rule that lives in a different group (a classic source of one-interval-stale results) and regroup them correctly.

3. **Size the interval** — recommend per-group `interval` values: short for alert-critical aggregates, longer for expensive rollups. Explain how interval interacts with rate() windows (always use a range at least 4x the scrape interval).

4. **Bound evaluation cost** — show how to read `prometheus_rule_group_last_duration_seconds` vs the group interval; if duration approaches interval, the group overruns. Recommend splitting heavy groups or moving heavy queries to a longer interval.

5. **Naming & layering** — propose a layered convention (level:metric:operation) so downstream rules read cleanly, and a file/group layout that keeps dependency chains within one group.

6. **Limit alerting blast** — for alert rules, set `for:` durations and use `keep_firing_for` where appropriate; explain why alert rules generally should NOT live in the same group as the recording rules they depend on unless freshness is critical.

Output as: (a) reorganized rule group YAML with explicit groups and intervals, (b) a dependency map of which rule feeds which, (c) an evaluation-latency checklist, (d) recommended interval values with justification.

Be precise about the within-group sequential / cross-group parallel distinction.

Free: the DevOps AI Incident-Triage Cheat Sheet