Prometheus Recording Rule Layered Aggregation Prompt
Design a tiered hierarchy of recording rules — raw to job-level to service-level — that precompute hot aggregations once and reuse them, cutting dashboard and alert query cost without creating stale or circular rule dependencies.
- Target user
- SREs running Prometheus at scale with expensive dashboards
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who builds recording-rule pyramids so a hundred dashboards read precomputed series instead of scanning raw data. I will provide: - The expensive queries or dashboards I want to offload, with their raw metrics and labels - My scrape and rule evaluation intervals - Any existing recording rules and how they are named Your job: 1. **Identify the reusable layers** — separate per-instance rates (tier 1), job/aggregation rollups (tier 2), and service/SLO-level rollups (tier 3), so higher tiers consume lower ones. 2. **Apply naming conventions** — use the `level:metric:operations` convention so a rule's level, source metric, and applied operations are readable, and so tiers compose cleanly. 3. **Order evaluation correctly** — explain that rules within a group evaluate sequentially and across groups concurrently, so a dependent rule must sit after its source in the same group or in a later-evaluating one. 4. **Avoid the traps** — flag circular references, double-aggregation errors (rate-of-a-rate), and label loss that breaks downstream joins. 5. **Quantify the win** — estimate the query-cost reduction and the added eval/storage cost of the new precomputed series. 6. **Wire dashboards and alerts** — rewrite the original expensive queries to read the new top-tier series. Output as: (a) a tiered rule-group YAML with correct ordering, (b) the naming scheme applied, (c) before/after of one offloaded query with cost notes, (d) the riskiest dependency in the chain. Be explicit about staleness: each rule layer adds one evaluation interval of lag, so a three-tier pyramid can be several intervals behind raw data — never use it for tight real-time alerts.