PromQL Query Optimization Prompt
Diagnose slow PromQL queries — cardinality explosion, range vector traps, sum vs avg pitfalls, query timeout, recording rules opportunity.
- Target user
- SREs and platform engineers writing PromQL
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has tuned PromQL queries at scale — from dashboard refresh delays to recording rule design.
I will provide:
- The slow query
- Prometheus version and head series count
- Symptom (timeout, slow rendering, OOM)
- Cardinality info (`prometheus_tsdb_head_series`)
Your job:
1. **Identify cardinality issues**:
- High-cardinality labels (pod UID, request ID) multiply series
- `count by ({label})` reveals series count per label
- Drop / aggregate high-card labels
2. **For range vectors**:
- `rate(metric[5m])` — 5-min window
- Larger window = more samples = slower
- Match window to scrape interval × N (usually 4×)
3. **For aggregation order**:
- `sum(rate(http_requests_total[5m]))` — correct order
- `rate(sum(http_requests_total)[5m])` — WRONG (sum is instant, not range)
4. **For sum vs avg**:
- `sum` aggregates values; for counters with rate, sum is correct
- `avg` gives average; misleading on counters
5. **For label_replace / label_join**:
- Expensive on high-card data
- Cache via recording rule if reused
6. **For recording rules**:
- Pre-compute frequently-queried expressions
- Evaluate at scrape interval; not on every dashboard load
- Naming: `:` prefix convention (`job:http_inprogress_requests:sum`)
7. **For query plan inspection**:
- `/api/v1/query?query=...&explain=true` (newer versions)
- Series selected, samples processed
8. **For dashboard impact**:
- Many panels × many queries × short refresh = apiserver overload
- Use shared variables to dedupe queries
Mark DESTRUCTIVE: query timeout removal (apiserver OOM), recording rules without retention adjustment (TSDB bloat), removing high-card labels without rebuilding alerts.
---
Slow query:
```promql
[PASTE]
```
Series count + cardinality info: [DESCRIBE]
Symptom: [DESCRIBE]
Why this prompt works
PromQL gotchas (range vector traps, sum order) cause slow queries that look correct. This prompt walks the common errors.
How to use it
- Always include cardinality info.
- For range vectors, verify window/scrape ratio.
- For repeated queries, recording rule candidate.
- Audit dashboard refresh rates.
Useful commands
# Cardinality
prometheus_tsdb_head_series
prometheus_tsdb_head_chunks
prometheus_tsdb_symbol_table_size_bytes
# Top series count by metric (Prometheus API)
topk(20, count by (__name__)({__name__=~".+"}))
# Top high-card labels
topk(20, count by (label_name)({__name__="metric_name"}))
# Query performance
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_max
# Series being queried
prometheus_rule_evaluations_total
prometheus_rule_evaluation_duration_seconds
# Explain endpoint (newer Prom)
curl 'http://prometheus:9090/api/v1/query?query=up&explain=true'
Optimization patterns
Before/after: sum order
# WRONG: rate of sum (sum is instant; not range)
rate(sum(http_requests_total)[5m])
# RIGHT: sum of rates
sum(rate(http_requests_total[5m]))
# Even better: pre-aggregate by job
sum by (job)(rate(http_requests_total[5m]))
Recording rule for hot query
# In Prometheus config
rule_files:
- /etc/prometheus/rules/*.yaml
# rules/recording.yaml
groups:
- name: http
interval: 30s
rules:
- record: job:http_requests_rate5m:sum
expr: sum by (job)(rate(http_requests_total[5m]))
Then in dashboards:
job:http_requests_rate5m:sum
Much faster than computing on every refresh.
Drop high-cardinality labels
# At scrape time (metric_relabel_configs)
- source_labels: [pod_uid]
action: labeldrop
- source_labels: [request_id]
action: labeldrop
Common findings this catches
- Query timeout → reduce window, add aggregation, recording rule.
- Range vector window too short → NaN for slow-scraping metrics; widen.
sum()of counter withoutrate→ meaningless (cumulative).label_replacerecomputed every refresh → recording rule.- Dashboard refresh too aggressive → 30s instead of 5s for non-critical.
- High-card label hidden in derived metric → audit at source.
- Multiple panels with same query → variable + reuse.
When to escalate
- TSDB sizing — capacity planning.
- Cardinality reduction at app source — engage app team.
- Federation / Thanos for global query — strategic.
Related prompts
-
Prometheus Performance Tuning Prompt
Tune Prometheus performance — head series, memory, query timeout, max samples, ingestion rate, expensive queries.
-
PromQL `rate()` vs `increase()` vs `irate()` Prompt
Use Prometheus counter functions correctly — rate vs increase vs irate, counter resets, window size choice.
-
PromQL Recording Rules Design Prompt
Design Prometheus recording rules — naming convention, evaluation interval, when to use, retention, multi-cluster patterns.