You are a senior platform engineer who has tuned Prometheus to handle millions of series per instance. I will provide: - Symptom (OOM, query slow, scrape lag, disk full) - Hardware - Current key metrics Your job: 1. **Sizing**: - Head series × overhead = memory - Bytes/sample × samples/sec × retention = disk - Typical: 1.5 bytes/sample compressed 2. **For OOM**: - High cardinality → main driver - Reduce labels, drop noisy metrics - Limit query memory (`--query.max-samples`) - HORIZONTALLY scale (shard) if vertical maxed 3. **For query slow**: - Recording rules - Reduce window - Drop labels 4. **For ingestion lag**: - Scrape queue backed up - Reduce scrape interval per job - Distribute targets across Proms 5. **For disk fast-filling**: - Cardinality - Retention - Compaction lag 6. **For tunable flags**: - `--storage.tsdb.retention.time` - `--storage.tsdb.retention.size` - `--query.timeout` (default 2m) - `--query.max-samples` (default 50M) - `--query.max-concurrency` (default 20) - `--storage.tsdb.head-chunks-write-buffer-size` 7. **For scaling**: - Vertical: bigger machine - Horizontal: multiple Prometheuses with sharded scrape - Federation for global view 8. **For monitoring Prometheus itself**: - Use a separate Prometheus or self-scrape - Watch for trending growth Mark DESTRUCTIVE: extreme retention reduction (history loss), removing query timeout (apiserver OOM), aggressive scrape interval reduction (overwhelm targets). --- Symptom: [DESCRIBE] Hardware: [DESCRIBE] Key metrics: ``` [PASTE prometheus_tsdb_head_series, etc.] ```

Why this prompt works

Tuning at scale requires knowing knobs. This prompt walks them.

How to use it

Audit cardinality first.
Match retention to need.
For OOM, drop labels.
For scale, horizontal.

Useful commands

# Memory
prometheus_tsdb_head_series                          # main driver
prometheus_tsdb_head_chunks
process_resident_memory_bytes

# Top metrics by series count
topk(20, count by (__name__)({__name__=~".+"}))

# Top labels by cardinality
sort_desc(label_replace(prometheus_tsdb_head_series, "metric", "$1", "__name__", "(.+)"))

# Query performance
prometheus_engine_query_duration_seconds_count
prometheus_engine_query_duration_seconds_sum / prometheus_engine_query_duration_seconds_count

# Scrape duration
prometheus_target_interval_length_seconds

Tuning checklist

# Long-term retention via remote write to Thanos/Mimir
remote_write:
- url: http://thanos-receive:19291/api/v1/receive

global:
  scrape_interval: 30s            # default reasonable
  evaluation_interval: 30s

# Scrape configs distributed across Prometheus instances if sharded

# Flags
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=200GB
--query.timeout=2m
--query.max-samples=50000000
--query.max-concurrency=20
--storage.tsdb.head-chunks-write-buffer-size=8MB

Sharding pattern

# Two Prometheus instances scraping disjoint sets
# Prometheus 1:
scrape_configs:
- job_name: nodes-shard-1
  static_configs:
  - targets: [node1, node3, node5, ...]

# Prometheus 2:
- job_name: nodes-shard-2
  static_configs:
  - targets: [node2, node4, node6, ...]

Or use hashmod:

relabel_configs:
- source_labels: [__address__]
  modulus: 2
  target_label: __tmp_hash
  action: hashmod
- source_labels: [__tmp_hash]
  regex: 0                       # this Prom: shard 0
  action: keep

Common findings this catches

Memory pegged → cardinality reduction.
Slow queries on dashboards → recording rules.
Long WAL replay on restart → compact more often.
Single Prom can’t keep up → shard.
Frequent OOM → cardinality leak.
Cardinality growth correlated with deploy → label added to metric.
Disk fills → retention drop.

When to escalate

Major architecture change (sharding, Thanos adoption) — strategic.
Cardinality reduction across many apps — coordination.
Hardware sizing — capacity planning.

Prometheus Performance Tuning Prompt

Why this prompt works

How to use it

Useful commands

Tuning checklist

Sharding pattern

Common findings this catches

When to escalate

Related prompts

Prometheus Remote Write & Long-term Storage Prompt

Prometheus Storage, Retention & TSDB Prompt

PromQL Query Optimization Prompt

Why this prompt works

How to use it

Useful commands

Tuning checklist

Sharding pattern

Common findings this catches

When to escalate

Related prompts

Prometheus Remote Write & Long-term Storage Prompt

Prometheus Storage, Retention & TSDB Prompt

PromQL Query Optimization Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet