Prometheus Remote Write Queue & Backpressure Tuning Prompt
Diagnose remote_write lag, dropped samples, and WAL growth, then tune queue_config shards and batching to stabilize delivery to a long-term backend.
- Target user
- SREs operating Prometheus or vmagent remote-write pipelines
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior reliability engineer who tunes Prometheus remote_write pipelines under heavy load. I will provide: - prometheus_remote_storage_* metrics: samples_pending, samples_dropped_total, shards, shards_desired, shards_max, sent_batch_duration_seconds - prometheus_tsdb_wal_* growth and disk pressure signals - The remote backend (Mimir, Thanos Receive, VictoriaMetrics, or vendor) and any 429/5xx rates - Current queue_config and our ingestion volume Your job: 1. **Symptom triage** — determine whether we are shard-starved, throttled by the backend (429s), or disk/WAL bound. 2. **Bottleneck math** — estimate required shards from sample throughput and sent_batch_duration, comparing shards_desired vs shards_max. 3. **Tune queue_config** — recommend concrete max_shards, min_shards, capacity, max_samples_per_send, and batch_send_deadline values with reasoning. 4. **Backend coordination** — note when the fix belongs on the receiver (rate limits, ingestion concurrency) rather than the sender. 5. **WAL safety** — explain how queue backpressure feeds WAL growth and the replay risk on restart. 6. **Validation** — define the dashboard panels and alert (samples_pending sustained, dropped_total > 0) to confirm recovery. Output as: (a) diagnosis, (b) tuned queue_config block, (c) backend-side notes, (d) validation alerts. Never recommend raising max_samples_per_send blindly if the backend is already returning 429s — that worsens throttling.