Prometheus Performance Tuning Prompt
Tune Prometheus performance — head series, memory, query timeout, max samples, ingestion rate, expensive queries.
- Target user
- Platform engineers tuning Prometheus under load
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has tuned Prometheus to handle millions of series per instance. I will provide: - Symptom (OOM, query slow, scrape lag, disk full) - Hardware - Current key metrics Your job: 1. **Sizing**: - Head series × overhead = memory - Bytes/sample × samples/sec × retention = disk - Typical: 1.5 bytes/sample compressed 2. **For OOM**: - High cardinality → main driver - Reduce labels, drop noisy metrics - Limit query memory (`--query.max-samples`) - HORIZONTALLY scale (shard) if vertical maxed 3. **For query slow**: - Recording rules - Reduce window - Drop labels 4. **For ingestion lag**: - Scrape queue backed up - Reduce scrape interval per job - Distribute targets across Proms 5. **For disk fast-filling**: - Cardinality - Retention - Compaction lag 6. **For tunable flags**: - `--storage.tsdb.retention.time` - `--storage.tsdb.retention.size` - `--query.timeout` (default 2m) - `--query.max-samples` (default 50M) - `--query.max-concurrency` (default 20) - `--storage.tsdb.head-chunks-write-buffer-size` 7. **For scaling**: - Vertical: bigger machine - Horizontal: multiple Prometheuses with sharded scrape - Federation for global view 8. **For monitoring Prometheus itself**: - Use a separate Prometheus or self-scrape - Watch for trending growth Mark DESTRUCTIVE: extreme retention reduction (history loss), removing query timeout (apiserver OOM), aggressive scrape interval reduction (overwhelm targets). --- Symptom: [DESCRIBE] Hardware: [DESCRIBE] Key metrics: ``` [PASTE prometheus_tsdb_head_series, etc.] ```
Why this prompt works
Tuning at scale requires knowing knobs. This prompt walks them.
How to use it
- Audit cardinality first.
- Match retention to need.
- For OOM, drop labels.
- For scale, horizontal.
Useful commands
# Memory
prometheus_tsdb_head_series # main driver
prometheus_tsdb_head_chunks
process_resident_memory_bytes
# Top metrics by series count
topk(20, count by (__name__)({__name__=~".+"}))
# Top labels by cardinality
sort_desc(label_replace(prometheus_tsdb_head_series, "metric", "$1", "__name__", "(.+)"))
# Query performance
prometheus_engine_query_duration_seconds_count
prometheus_engine_query_duration_seconds_sum / prometheus_engine_query_duration_seconds_count
# Scrape duration
prometheus_target_interval_length_seconds
Tuning checklist
# Long-term retention via remote write to Thanos/Mimir
remote_write:
- url: http://thanos-receive:19291/api/v1/receive
global:
scrape_interval: 30s # default reasonable
evaluation_interval: 30s
# Scrape configs distributed across Prometheus instances if sharded
# Flags
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=200GB
--query.timeout=2m
--query.max-samples=50000000
--query.max-concurrency=20
--storage.tsdb.head-chunks-write-buffer-size=8MB
Sharding pattern
# Two Prometheus instances scraping disjoint sets
# Prometheus 1:
scrape_configs:
- job_name: nodes-shard-1
static_configs:
- targets: [node1, node3, node5, ...]
# Prometheus 2:
- job_name: nodes-shard-2
static_configs:
- targets: [node2, node4, node6, ...]
Or use hashmod:
relabel_configs:
- source_labels: [__address__]
modulus: 2
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: 0 # this Prom: shard 0
action: keep
Common findings this catches
- Memory pegged → cardinality reduction.
- Slow queries on dashboards → recording rules.
- Long WAL replay on restart → compact more often.
- Single Prom can’t keep up → shard.
- Frequent OOM → cardinality leak.
- Cardinality growth correlated with deploy → label added to metric.
- Disk fills → retention drop.
When to escalate
- Major architecture change (sharding, Thanos adoption) — strategic.
- Cardinality reduction across many apps — coordination.
- Hardware sizing — capacity planning.
Related prompts
-
Prometheus Remote Write & Long-term Storage Prompt
Configure remote write to long-term storage — Thanos Receive, Cortex/Mimir, VictoriaMetrics, troubleshoot queue/backlog/back-pressure.
-
Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.
-
PromQL Query Optimization Prompt
Diagnose slow PromQL queries — cardinality explosion, range vector traps, sum vs avg pitfalls, query timeout, recording rules opportunity.