Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Performance Tuning Prompt

Tune Prometheus performance — head series, memory, query timeout, max samples, ingestion rate, expensive queries.

Target user
Platform engineers tuning Prometheus under load
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has tuned Prometheus to handle millions of series per instance.

I will provide:
- Symptom (OOM, query slow, scrape lag, disk full)
- Hardware
- Current key metrics

Your job:

1. **Sizing**:
   - Head series × overhead = memory
   - Bytes/sample × samples/sec × retention = disk
   - Typical: 1.5 bytes/sample compressed
2. **For OOM**:
   - High cardinality → main driver
   - Reduce labels, drop noisy metrics
   - Limit query memory (`--query.max-samples`)
   - HORIZONTALLY scale (shard) if vertical maxed
3. **For query slow**:
   - Recording rules
   - Reduce window
   - Drop labels
4. **For ingestion lag**:
   - Scrape queue backed up
   - Reduce scrape interval per job
   - Distribute targets across Proms
5. **For disk fast-filling**:
   - Cardinality
   - Retention
   - Compaction lag
6. **For tunable flags**:
   - `--storage.tsdb.retention.time`
   - `--storage.tsdb.retention.size`
   - `--query.timeout` (default 2m)
   - `--query.max-samples` (default 50M)
   - `--query.max-concurrency` (default 20)
   - `--storage.tsdb.head-chunks-write-buffer-size`
7. **For scaling**:
   - Vertical: bigger machine
   - Horizontal: multiple Prometheuses with sharded scrape
   - Federation for global view
8. **For monitoring Prometheus itself**:
   - Use a separate Prometheus or self-scrape
   - Watch for trending growth

Mark DESTRUCTIVE: extreme retention reduction (history loss), removing query timeout (apiserver OOM), aggressive scrape interval reduction (overwhelm targets).

---

Symptom: [DESCRIBE]
Hardware: [DESCRIBE]
Key metrics:
```
[PASTE prometheus_tsdb_head_series, etc.]
```

Why this prompt works

Tuning at scale requires knowing knobs. This prompt walks them.

How to use it

  1. Audit cardinality first.
  2. Match retention to need.
  3. For OOM, drop labels.
  4. For scale, horizontal.

Useful commands

# Memory
prometheus_tsdb_head_series                          # main driver
prometheus_tsdb_head_chunks
process_resident_memory_bytes

# Top metrics by series count
topk(20, count by (__name__)({__name__=~".+"}))

# Top labels by cardinality
sort_desc(label_replace(prometheus_tsdb_head_series, "metric", "$1", "__name__", "(.+)"))

# Query performance
prometheus_engine_query_duration_seconds_count
prometheus_engine_query_duration_seconds_sum / prometheus_engine_query_duration_seconds_count

# Scrape duration
prometheus_target_interval_length_seconds

Tuning checklist

# Long-term retention via remote write to Thanos/Mimir
remote_write:
- url: http://thanos-receive:19291/api/v1/receive

global:
  scrape_interval: 30s            # default reasonable
  evaluation_interval: 30s

# Scrape configs distributed across Prometheus instances if sharded
# Flags
--storage.tsdb.retention.time=15d
--storage.tsdb.retention.size=200GB
--query.timeout=2m
--query.max-samples=50000000
--query.max-concurrency=20
--storage.tsdb.head-chunks-write-buffer-size=8MB

Sharding pattern

# Two Prometheus instances scraping disjoint sets
# Prometheus 1:
scrape_configs:
- job_name: nodes-shard-1
  static_configs:
  - targets: [node1, node3, node5, ...]

# Prometheus 2:
- job_name: nodes-shard-2
  static_configs:
  - targets: [node2, node4, node6, ...]

Or use hashmod:

relabel_configs:
- source_labels: [__address__]
  modulus: 2
  target_label: __tmp_hash
  action: hashmod
- source_labels: [__tmp_hash]
  regex: 0                       # this Prom: shard 0
  action: keep

Common findings this catches

  • Memory pegged → cardinality reduction.
  • Slow queries on dashboards → recording rules.
  • Long WAL replay on restart → compact more often.
  • Single Prom can’t keep up → shard.
  • Frequent OOM → cardinality leak.
  • Cardinality growth correlated with deploy → label added to metric.
  • Disk fills → retention drop.

When to escalate

  • Major architecture change (sharding, Thanos adoption) — strategic.
  • Cardinality reduction across many apps — coordination.
  • Hardware sizing — capacity planning.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week