You are a senior SRE who has tuned PromQL queries at scale — from dashboard refresh delays to recording rule design. I will provide: - The slow query - Prometheus version and head series count - Symptom (timeout, slow rendering, OOM) - Cardinality info (`prometheus_tsdb_head_series`) Your job: 1. **Identify cardinality issues**: - High-cardinality labels (pod UID, request ID) multiply series - `count by ({label})` reveals series count per label - Drop / aggregate high-card labels 2. **For range vectors**: - `rate(metric[5m])` — 5-min window - Larger window = more samples = slower - Match window to scrape interval × N (usually 4×) 3. **For aggregation order**: - `sum(rate(http_requests_total[5m]))` — correct order - `rate(sum(http_requests_total)[5m])` — WRONG (sum is instant, not range) 4. **For sum vs avg**: - `sum` aggregates values; for counters with rate, sum is correct - `avg` gives average; misleading on counters 5. **For label_replace / label_join**: - Expensive on high-card data - Cache via recording rule if reused 6. **For recording rules**: - Pre-compute frequently-queried expressions - Evaluate at scrape interval; not on every dashboard load - Naming: `:` prefix convention (`job:http_inprogress_requests:sum`) 7. **For query plan inspection**: - `/api/v1/query?query=...&explain=true` (newer versions) - Series selected, samples processed 8. **For dashboard impact**: - Many panels × many queries × short refresh = apiserver overload - Use shared variables to dedupe queries Mark DESTRUCTIVE: query timeout removal (apiserver OOM), recording rules without retention adjustment (TSDB bloat), removing high-card labels without rebuilding alerts. --- Slow query: ```promql [PASTE] ``` Series count + cardinality info: [DESCRIBE] Symptom: [DESCRIBE]

Why this prompt works

PromQL gotchas (range vector traps, sum order) cause slow queries that look correct. This prompt walks the common errors.

How to use it

Always include cardinality info.
For range vectors, verify window/scrape ratio.
For repeated queries, recording rule candidate.
Audit dashboard refresh rates.

Useful commands

# Cardinality
prometheus_tsdb_head_series
prometheus_tsdb_head_chunks
prometheus_tsdb_symbol_table_size_bytes

# Top series count by metric (Prometheus API)
topk(20, count by (__name__)({__name__=~".+"}))

# Top high-card labels
topk(20, count by (label_name)({__name__="metric_name"}))

# Query performance
prometheus_engine_query_duration_seconds
prometheus_engine_queries_concurrent_max

# Series being queried
prometheus_rule_evaluations_total
prometheus_rule_evaluation_duration_seconds

# Explain endpoint (newer Prom)
curl 'http://prometheus:9090/api/v1/query?query=up&explain=true'

Optimization patterns

Before/after: sum order

# WRONG: rate of sum (sum is instant; not range)
rate(sum(http_requests_total)[5m])

# RIGHT: sum of rates
sum(rate(http_requests_total[5m]))

# Even better: pre-aggregate by job
sum by (job)(rate(http_requests_total[5m]))

Recording rule for hot query

# In Prometheus config
rule_files:
- /etc/prometheus/rules/*.yaml

# rules/recording.yaml
groups:
- name: http
  interval: 30s
  rules:
  - record: job:http_requests_rate5m:sum
    expr: sum by (job)(rate(http_requests_total[5m]))

Then in dashboards:

job:http_requests_rate5m:sum

Much faster than computing on every refresh.

Drop high-cardinality labels

# At scrape time (metric_relabel_configs)
- source_labels: [pod_uid]
  action: labeldrop
- source_labels: [request_id]
  action: labeldrop

Common findings this catches

Query timeout → reduce window, add aggregation, recording rule.
Range vector window too short → NaN for slow-scraping metrics; widen.
sum() of counter without rate → meaningless (cumulative).
label_replace recomputed every refresh → recording rule.
Dashboard refresh too aggressive → 30s instead of 5s for non-critical.
High-card label hidden in derived metric → audit at source.
Multiple panels with same query → variable + reuse.

When to escalate

TSDB sizing — capacity planning.
Cardinality reduction at app source — engage app team.
Federation / Thanos for global query — strategic.

PromQL Query Optimization Prompt

Why this prompt works

How to use it

Useful commands

Optimization patterns

Before/after: sum order

Recording rule for hot query

Drop high-cardinality labels

Common findings this catches

When to escalate

Related prompts

Prometheus Performance Tuning Prompt

PromQL `rate()` vs `increase()` vs `irate()` Prompt

PromQL Recording Rules Design Prompt

Why this prompt works

How to use it

Useful commands

Optimization patterns

Before/after: sum order

Recording rule for hot query

Drop high-cardinality labels

Common findings this catches

When to escalate

Related prompts

Prometheus Performance Tuning Prompt

PromQL `rate()` vs `increase()` vs `irate()` Prompt

PromQL Recording Rules Design Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet