Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

PromQL Recording Rules Design Prompt

Design Prometheus recording rules — naming convention, evaluation interval, when to use, retention, multi-cluster patterns.

Target user
SREs designing pre-computed metrics
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior SRE who has designed recording rules for production Prometheus deployments — naming conventions, evaluation intervals, sharding strategies.

I will provide:
- Use case (slow dashboard query, alert simplification, multi-cluster aggregate)
- Current rule file
- Symptom (rules slow, missing metrics, naming confusion)

Your job:

1. **When recording rules**:
   - Frequently queried expressions
   - Computationally expensive (joins, regex, aggregations)
   - Reused across many dashboards / alerts
   - Cross-cluster aggregation results
2. **Naming convention** (recommended):
   - `level:metric:operations` (e.g., `job:http_requests:rate5m`)
   - Colon-separated for clarity
   - `level` = aggregation scope (job, instance, cluster)
   - `operations` = applied (rate5m, sum, avg)
3. **For evaluation interval**:
   - Default matches global scrape interval
   - Group-level `interval` override
   - Faster eval = more accurate, more CPU
4. **For rule groups**:
   - Group rules that depend on each other
   - Sequential evaluation within group
   - Parallel between groups
5. **For storage impact**:
   - Each recording rule = additional time series
   - Plan retention + cardinality
6. **For multi-cluster aggregation**:
   - Per-cluster recording rule with cluster label
   - Federated query against the recorded metric
7. **For staleness**:
   - Rule depends on its input being available
   - If input metric goes stale (5 min default), rule output also stale
8. **For testing**:
   - `promtool check rules <file>` for syntax
   - Run rule expression in `/graph` before committing

Mark DESTRUCTIVE: rules with high cardinality output (TSDB bloat), changing recording rule name without updating consumers (dashboards/alerts break), removing rule with active alerts.

---

Use case: [DESCRIBE]
Current rules (if any):
```yaml
[PASTE]
```

Why this prompt works

Recording rules are powerful but conventions matter. This prompt walks the design.

How to use it

  1. Identify hot queries for rule candidates.
  2. Follow naming convention for discoverability.
  3. Test before committing.
  4. Budget cardinality.

Useful commands

# Validate rules
promtool check rules /etc/prometheus/rules/*.yaml

# Test expression
curl 'http://prometheus:9090/api/v1/query?query=<expression>'

# List rules in Prometheus
curl http://prometheus:9090/api/v1/rules | jq

# Rule eval duration
prometheus_rule_evaluation_duration_seconds

# Rule output cardinality
count(level:metric:operations)

Patterns

Standard recording rules file

groups:
- name: http_aggregates
  interval: 30s
  rules:
  - record: job:http_requests_total:rate5m
    expr: sum by (job)(rate(http_requests_total[5m]))

  - record: job:http_requests_total:rate30m
    expr: sum by (job)(rate(http_requests_total[30m]))

  - record: job:http_request_duration_seconds:p99
    expr: histogram_quantile(0.99, sum by (job, le)(rate(http_request_duration_seconds_bucket[5m])))

  - record: cluster_job:http_requests_total:rate5m:sum
    expr: sum by (cluster, job)(rate(http_requests_total[5m]))

- name: slo_components
  interval: 30s
  rules:
  - record: job:http_requests:rate5m
    expr: sum by (job)(rate(http_requests_total[5m]))

  - record: job:http_errors:rate5m
    expr: sum by (job)(rate(http_requests_total{code=~"5.."}[5m]))

  - record: job:http_availability:ratio5m
    expr: |
      (job:http_requests:rate5m - job:http_errors:rate5m) /
      job:http_requests:rate5m

Naming convention

level:metric:operations

level:        job, instance, cluster, deployment
metric:       original metric name (or descriptive)
operations:   rate5m, sum, p99, count

Examples:
job:http_requests_total:rate5m
instance:node_cpu_usage:avg
cluster:apiserver_request:p99
deployment:replicas:ready_ratio

Common findings this catches

  • Recording rule never queried → remove or document.
  • High cardinality output → aggregate further or drop labels.
  • Recording rules in alphabet order with dependencies → rule order matters within group.
  • Eval slow → simplify or split into rules.
  • Naming inconsistent → standardize.
  • Cross-cluster dedup wrong → tune external_labels.
  • Rule output stale when input missing → check staleness handling.

When to escalate

  • Cluster-wide recording rule strategy — strategic.
  • TSDB capacity from recording rules — capacity planning.
  • Multi-cluster aggregation design — federation.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week