Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL Query Optimization Prompt

Diagnose slow PromQL queries — cardinality explosion, range vector traps, sum vs avg pitfalls, query timeout, recording rules opportunity.

Target user
SREs and platform engineers writing PromQL
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior SRE who has tuned PromQL queries at scale — from dashboard refresh delays to recording rule design.

I will provide:
- The slow query
- Prometheus version and head series count
- Symptom (timeout, slow rendering, OOM)
- Cardinality info (`prometheus_tsdb_head_series`)

Your job:

1. **Identify cardinality issues**:
   - High-cardinality labels (pod UID, request ID) multiply series
   - `count by ({label})` reveals series count per label
   - Drop / aggregate high-card labels
2. **For range vectors**:
   - `rate(metric[5m])` — 5-min window
   - Larger window = more samples = slower
   - Match window to scrape interval × N (usually 4×)
3. **For aggregation order**:
   - `sum(rate(http_requests_total[5m]))` — correct order
   - `rate(sum(http_requests_total)[5m])` — WRONG (sum is instant, not range)
4. **For sum vs avg**:
   - `sum` aggregates values; for counters with rate, sum is correct
   - `avg` gives average; misleading on counters
5. **For label_replace / label_join**:
   - Expensive on high-card data
   - Cache via recording rule if reused
6. **For recording rules**:
   - Pre-compute frequently-queried expressions
   - Evaluate at scrape interval; not on every dashboard load
   - Naming: `:` prefix convention (`job:http_inprogress_requests:sum`)
7. **For query plan inspection**:
   - `/api/v1/query?query=...&explain=true` (newer versions)
   - Series selected, samples processed
8. **For dashboard impact**:
   - Many panels × many queries × short refresh = apiserver overload
   - Use shared variables to dedupe queries

Mark DESTRUCTIVE: query timeout removal (apiserver OOM), recording rules without retention adjustment (TSDB bloat), removing high-card labels without rebuilding alerts.

---

Slow query:
```promql
[PASTE]
```
Series count + cardinality info: [DESCRIBE]
Symptom: [DESCRIBE]

Why this prompt works

PromQL gotchas (range vector traps, sum order) cause slow queries that look correct. This prompt walks the common errors.

How to use it

  1. Always include cardinality info.
  2. For range vectors, verify window/scrape ratio.
  3. For repeated queries, recording rule candidate.
  4. Audit dashboard refresh rates.

Useful commands

# Cardinality
prometheus_tsdb_head_series
prometheus_tsdb_head_chunks
prometheus_tsdb_symbol_table_size_bytes

# Top series count by metric (Prometheus API)
topk(20, count by (__name__)({__name__=~".+"}))

# Top high-card labels
topk(20, count by (label_name)({__name__="metric_name"}))

# Query performance
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_max

# Series being queried
prometheus_rule_evaluations_total
prometheus_rule_evaluation_duration_seconds

# Explain endpoint (newer Prom)
curl 'http://prometheus:9090/api/v1/query?query=up&explain=true'

Optimization patterns

Before/after: sum order

# WRONG: rate of sum (sum is instant; not range)
rate(sum(http_requests_total)[5m])

# RIGHT: sum of rates
sum(rate(http_requests_total[5m]))

# Even better: pre-aggregate by job
sum by (job)(rate(http_requests_total[5m]))

Recording rule for hot query

# In Prometheus config
rule_files:
- /etc/prometheus/rules/*.yaml

# rules/recording.yaml
groups:
- name: http
  interval: 30s
  rules:
  - record: job:http_requests_rate5m:sum
    expr: sum by (job)(rate(http_requests_total[5m]))

Then in dashboards:

job:http_requests_rate5m:sum

Much faster than computing on every refresh.

Drop high-cardinality labels

# At scrape time (metric_relabel_configs)
- source_labels: [pod_uid]
  action: labeldrop
- source_labels: [request_id]
  action: labeldrop

Common findings this catches

  • Query timeout → reduce window, add aggregation, recording rule.
  • Range vector window too short → NaN for slow-scraping metrics; widen.
  • sum() of counter without rate → meaningless (cumulative).
  • label_replace recomputed every refresh → recording rule.
  • Dashboard refresh too aggressive → 30s instead of 5s for non-critical.
  • High-card label hidden in derived metric → audit at source.
  • Multiple panels with same query → variable + reuse.

When to escalate

  • TSDB sizing — capacity planning.
  • Cardinality reduction at app source — engage app team.
  • Federation / Thanos for global query — strategic.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week