AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Recording Rule Layered Aggregation Prompt

Design a tiered hierarchy of recording rules — raw to job-level to service-level — that precompute hot aggregations once and reuse them, cutting dashboard and alert query cost without creating stale or circular rule dependencies.

Target user: SREs running Prometheus at scale with expensive dashboards
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who builds recording-rule pyramids so a hundred dashboards read precomputed series instead of scanning raw data.

I will provide:
- The expensive queries or dashboards I want to offload, with their raw metrics and labels
- My scrape and rule evaluation intervals
- Any existing recording rules and how they are named

Your job:

1. **Identify the reusable layers** — separate per-instance rates (tier 1), job/aggregation rollups (tier 2), and service/SLO-level rollups (tier 3), so higher tiers consume lower ones.
2. **Apply naming conventions** — use the `level:metric:operations` convention so a rule's level, source metric, and applied operations are readable, and so tiers compose cleanly.
3. **Order evaluation correctly** — explain that rules within a group evaluate sequentially and across groups concurrently, so a dependent rule must sit after its source in the same group or in a later-evaluating one.
4. **Avoid the traps** — flag circular references, double-aggregation errors (rate-of-a-rate), and label loss that breaks downstream joins.
5. **Quantify the win** — estimate the query-cost reduction and the added eval/storage cost of the new precomputed series.
6. **Wire dashboards and alerts** — rewrite the original expensive queries to read the new top-tier series.

Output as: (a) a tiered rule-group YAML with correct ordering, (b) the naming scheme applied, (c) before/after of one offloaded query with cost notes, (d) the riskiest dependency in the chain.

Be explicit about staleness: each rule layer adds one evaluation interval of lag, so a three-tier pyramid can be several intervals behind raw data — never use it for tight real-time alerts.

Free: the DevOps AI Incident-Triage Cheat Sheet