Prometheus Remote Write & Long-term Storage Prompt
Configure remote write to long-term storage — Thanos Receive, Cortex/Mimir, VictoriaMetrics, troubleshoot queue/backlog/back-pressure.
- Target user
- Platform engineers running scalable Prometheus
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has integrated Prometheus with long-term storage — Thanos, Cortex/Mimir, VictoriaMetrics — for global query and multi-year retention. I will provide: - Long-term storage backend choice - Symptom (remote write queue growing, samples dropped, slow ingest) - Current `remote_write` config Your job: 1. **When remote write**: - Long-term retention (months/years) beyond local TSDB - Multi-Prometheus aggregation - Disaster recovery 2. **Backend choices**: - **Thanos** — sidecar uploads blocks to S3; Querier federates - **Mimir / Cortex** — multi-tenant Prometheus-compatible - **VictoriaMetrics** — open source, single-binary or cluster - **Grafana Cloud** — managed 3. **For remote_write config**: - `url` — remote endpoint - `queue_config` — buffering, batch size, max samples per send - `write_relabel_configs` — drop / transform before send 4. **For "queue growing"**: - Remote slower than ingest rate - Tune queue: increase `capacity`, `max_samples_per_send` - Or: backend too small 5. **For "samples dropped"**: - Queue full → samples dropped - Check `prometheus_remote_storage_samples_dropped_total` - Reduce ingest rate or scale backend 6. **For "back-pressure"**: - Prometheus blocks on full queue - Affects local TSDB too - Critical to monitor 7. **For authentication**: - Bearer token, basic auth, mTLS, sigv4 - Secret management 8. **For metric filtering** before send: - `write_relabel_configs` to drop noise - Saves bandwidth + backend cost Mark DESTRUCTIVE: removing remote write while backend depends (gap in long-term history), changing endpoint without verifying (data loss), aggressive queue dropping samples. --- Backend: [Thanos / Mimir / VictoriaMetrics / Grafana Cloud] Symptom: [DESCRIBE] `remote_write` config: ```yaml [PASTE] ```
Why this prompt works
Long-term storage at scale requires understanding the remote write pipeline. This prompt walks it.
How to use it
- Pick backend based on needs.
- Tune queue for backend speed.
- Filter at source to save bandwidth.
- Monitor queue health.
Useful commands
# Remote write metrics
prometheus_remote_storage_samples_in_total
prometheus_remote_storage_samples_failed_total
prometheus_remote_storage_samples_dropped_total
prometheus_remote_storage_shard_capacity
prometheus_remote_storage_shards
prometheus_remote_storage_shards_desired
prometheus_remote_storage_pending_samples
prometheus_remote_storage_queue_highest_sent_timestamp_seconds
prometheus_remote_storage_highest_timestamp_in_seconds
# Lag (ingest vs sent)
prometheus_remote_storage_highest_timestamp_in_seconds
- on(remote_name) group_right
prometheus_remote_storage_queue_highest_sent_timestamp_seconds
Config patterns
Thanos Receive
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
capacity: 10000
max_samples_per_send: 2000
batch_send_deadline: 5s
min_shards: 1
max_shards: 50
write_relabel_configs:
# Drop noisy
- source_labels: [__name__]
regex: 'go_.*'
action: drop
Mimir
remote_write:
- url: "https://mimir.example.com/api/v1/push"
basic_auth:
username: tenant1
password_file: /etc/secret/mimir-password
queue_config:
capacity: 10000
max_samples_per_send: 2000
VictoriaMetrics
remote_write:
- url: "https://victoriametrics.example.com/api/v1/write"
queue_config:
capacity: 10000
max_samples_per_send: 5000 # VM tolerates large batches
Filter (drop high-cardinality at source)
remote_write:
- url: "..."
write_relabel_configs:
# Keep only essentials
- source_labels: [__name__]
regex: 'up|node_.*|http_requests_total|http_request_duration.*'
action: keep
# Drop pod-uid label
- regex: 'pod_uid'
action: labeldrop
Common findings this catches
- Queue growing constantly → backend too slow; scale or filter.
- Samples dropped → queue cap hit; tune.
- Lag growing → ingest > send; scale shards.
- Auth failures → token expired.
- Local Prom OOM with queue full → back-pressure.
- Backend ingest issues at scale → backend capacity.
- Network partition → samples buffered until limit, then dropped.
When to escalate
- Backend capacity planning — strategic.
- Multi-region replication — DR.
- Migration between backends — staged.
Related prompts
-
Prometheus Performance Tuning Prompt
Tune Prometheus performance — head series, memory, query timeout, max samples, ingestion rate, expensive queries.
-
Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.
-
Thanos Architecture & Component Debug Prompt
Operate Thanos — Sidecar, Receive, Store Gateway, Compactor, Querier, Ruler; troubleshoot dedup, downsampling, S3 issues.