PromQL Recording Rules Design Prompt
Design Prometheus recording rules — naming convention, evaluation interval, when to use, retention, multi-cluster patterns.
- Target user
- SREs designing pre-computed metrics
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has designed recording rules for production Prometheus deployments — naming conventions, evaluation intervals, sharding strategies. I will provide: - Use case (slow dashboard query, alert simplification, multi-cluster aggregate) - Current rule file - Symptom (rules slow, missing metrics, naming confusion) Your job: 1. **When recording rules**: - Frequently queried expressions - Computationally expensive (joins, regex, aggregations) - Reused across many dashboards / alerts - Cross-cluster aggregation results 2. **Naming convention** (recommended): - `level:metric:operations` (e.g., `job:http_requests:rate5m`) - Colon-separated for clarity - `level` = aggregation scope (job, instance, cluster) - `operations` = applied (rate5m, sum, avg) 3. **For evaluation interval**: - Default matches global scrape interval - Group-level `interval` override - Faster eval = more accurate, more CPU 4. **For rule groups**: - Group rules that depend on each other - Sequential evaluation within group - Parallel between groups 5. **For storage impact**: - Each recording rule = additional time series - Plan retention + cardinality 6. **For multi-cluster aggregation**: - Per-cluster recording rule with cluster label - Federated query against the recorded metric 7. **For staleness**: - Rule depends on its input being available - If input metric goes stale (5 min default), rule output also stale 8. **For testing**: - `promtool check rules <file>` for syntax - Run rule expression in `/graph` before committing Mark DESTRUCTIVE: rules with high cardinality output (TSDB bloat), changing recording rule name without updating consumers (dashboards/alerts break), removing rule with active alerts. --- Use case: [DESCRIBE] Current rules (if any): ```yaml [PASTE] ```
Why this prompt works
Recording rules are powerful but conventions matter. This prompt walks the design.
How to use it
- Identify hot queries for rule candidates.
- Follow naming convention for discoverability.
- Test before committing.
- Budget cardinality.
Useful commands
# Validate rules
promtool check rules /etc/prometheus/rules/*.yaml
# Test expression
curl 'http://prometheus:9090/api/v1/query?query=<expression>'
# List rules in Prometheus
curl http://prometheus:9090/api/v1/rules | jq
# Rule eval duration
prometheus_rule_evaluation_duration_seconds
# Rule output cardinality
count(level:metric:operations)
Patterns
Standard recording rules file
groups:
- name: http_aggregates
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: sum by (job)(rate(http_requests_total[5m]))
- record: job:http_requests_total:rate30m
expr: sum by (job)(rate(http_requests_total[30m]))
- record: job:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, sum by (job, le)(rate(http_request_duration_seconds_bucket[5m])))
- record: cluster_job:http_requests_total:rate5m:sum
expr: sum by (cluster, job)(rate(http_requests_total[5m]))
- name: slo_components
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job)(rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job)(rate(http_requests_total{code=~"5.."}[5m]))
- record: job:http_availability:ratio5m
expr: |
(job:http_requests:rate5m - job:http_errors:rate5m) /
job:http_requests:rate5m
Naming convention
level:metric:operations
level: job, instance, cluster, deployment
metric: original metric name (or descriptive)
operations: rate5m, sum, p99, count
Examples:
job:http_requests_total:rate5m
instance:node_cpu_usage:avg
cluster:apiserver_request:p99
deployment:replicas:ready_ratio
Common findings this catches
- Recording rule never queried → remove or document.
- High cardinality output → aggregate further or drop labels.
- Recording rules in alphabet order with dependencies → rule order matters within group.
- Eval slow → simplify or split into rules.
- Naming inconsistent → standardize.
- Cross-cluster dedup wrong → tune external_labels.
- Rule output stale when input missing → check staleness handling.
When to escalate
- Cluster-wide recording rule strategy — strategic.
- TSDB capacity from recording rules — capacity planning.
- Multi-cluster aggregation design — federation.
Related prompts
-
Prometheus Alert Rule Generator Prompt
Generate production-quality Prometheus alerting rules with sensible thresholds, labels, and runbook annotations.
-
Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.
-
PromQL Query Optimization Prompt
Diagnose slow PromQL queries — cardinality explosion, range vector traps, sum vs avg pitfalls, query timeout, recording rules opportunity.