Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Remote Write & Long-term Storage Prompt

Configure remote write to long-term storage — Thanos Receive, Cortex/Mimir, VictoriaMetrics, troubleshoot queue/backlog/back-pressure.

Target user
Platform engineers running scalable Prometheus
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has integrated Prometheus with long-term storage — Thanos, Cortex/Mimir, VictoriaMetrics — for global query and multi-year retention.

I will provide:
- Long-term storage backend choice
- Symptom (remote write queue growing, samples dropped, slow ingest)
- Current `remote_write` config

Your job:

1. **When remote write**:
   - Long-term retention (months/years) beyond local TSDB
   - Multi-Prometheus aggregation
   - Disaster recovery
2. **Backend choices**:
   - **Thanos** — sidecar uploads blocks to S3; Querier federates
   - **Mimir / Cortex** — multi-tenant Prometheus-compatible
   - **VictoriaMetrics** — open source, single-binary or cluster
   - **Grafana Cloud** — managed
3. **For remote_write config**:
   - `url` — remote endpoint
   - `queue_config` — buffering, batch size, max samples per send
   - `write_relabel_configs` — drop / transform before send
4. **For "queue growing"**:
   - Remote slower than ingest rate
   - Tune queue: increase `capacity`, `max_samples_per_send`
   - Or: backend too small
5. **For "samples dropped"**:
   - Queue full → samples dropped
   - Check `prometheus_remote_storage_samples_dropped_total`
   - Reduce ingest rate or scale backend
6. **For "back-pressure"**:
   - Prometheus blocks on full queue
   - Affects local TSDB too
   - Critical to monitor
7. **For authentication**:
   - Bearer token, basic auth, mTLS, sigv4
   - Secret management
8. **For metric filtering** before send:
   - `write_relabel_configs` to drop noise
   - Saves bandwidth + backend cost

Mark DESTRUCTIVE: removing remote write while backend depends (gap in long-term history), changing endpoint without verifying (data loss), aggressive queue dropping samples.

---

Backend: [Thanos / Mimir / VictoriaMetrics / Grafana Cloud]
Symptom: [DESCRIBE]
`remote_write` config:
```yaml
[PASTE]
```

Why this prompt works

Long-term storage at scale requires understanding the remote write pipeline. This prompt walks it.

How to use it

  1. Pick backend based on needs.
  2. Tune queue for backend speed.
  3. Filter at source to save bandwidth.
  4. Monitor queue health.

Useful commands

# Remote write metrics
prometheus_remote_storage_samples_in_total
prometheus_remote_storage_samples_failed_total
prometheus_remote_storage_samples_dropped_total
prometheus_remote_storage_shard_capacity
prometheus_remote_storage_shards
prometheus_remote_storage_shards_desired
prometheus_remote_storage_pending_samples
prometheus_remote_storage_queue_highest_sent_timestamp_seconds
prometheus_remote_storage_highest_timestamp_in_seconds

# Lag (ingest vs sent)
prometheus_remote_storage_highest_timestamp_in_seconds
  - on(remote_name) group_right
prometheus_remote_storage_queue_highest_sent_timestamp_seconds

Config patterns

Thanos Receive

remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
  queue_config:
    capacity: 10000
    max_samples_per_send: 2000
    batch_send_deadline: 5s
    min_shards: 1
    max_shards: 50
  write_relabel_configs:
  # Drop noisy
  - source_labels: [__name__]
    regex: 'go_.*'
    action: drop

Mimir

remote_write:
- url: "https://mimir.example.com/api/v1/push"
  basic_auth:
    username: tenant1
    password_file: /etc/secret/mimir-password
  queue_config:
    capacity: 10000
    max_samples_per_send: 2000

VictoriaMetrics

remote_write:
- url: "https://victoriametrics.example.com/api/v1/write"
  queue_config:
    capacity: 10000
    max_samples_per_send: 5000        # VM tolerates large batches

Filter (drop high-cardinality at source)

remote_write:
- url: "..."
  write_relabel_configs:
  # Keep only essentials
  - source_labels: [__name__]
    regex: 'up|node_.*|http_requests_total|http_request_duration.*'
    action: keep
  # Drop pod-uid label
  - regex: 'pod_uid'
    action: labeldrop

Common findings this catches

  • Queue growing constantly → backend too slow; scale or filter.
  • Samples dropped → queue cap hit; tune.
  • Lag growing → ingest > send; scale shards.
  • Auth failures → token expired.
  • Local Prom OOM with queue full → back-pressure.
  • Backend ingest issues at scale → backend capacity.
  • Network partition → samples buffered until limit, then dropped.

When to escalate

  • Backend capacity planning — strategic.
  • Multi-region replication — DR.
  • Migration between backends — staged.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week