Tuning Prometheus Remote Write for Reliable Metric Shipping

Remote write is the seam where Prometheus stops being a single box and becomes part of a long-term, horizontally-scaled monitoring system. It’s how your local Prometheus ships samples to Thanos Receive, Mimir, or a managed backend. It’s also where samples quietly vanish under load when the queue can’t keep up — and the failure is invisible unless you’re watching the right metrics. I’ve debugged enough “where did last hour’s data go” mysteries to know the defaults aren’t enough at scale.

How remote write actually moves data

Prometheus writes samples to its local WAL, then a set of shards reads from the WAL and ships batches to the remote endpoint. Each shard maintains an in-memory queue. If the endpoint is slow or the sample rate exceeds what the shards can flush, those queues fill, and once they’re full, samples are dropped. Not retried forever — dropped. Understanding that pipeline is the whole game: WAL → shards → queues → batched HTTP → remote endpoint.

The config that matters

Here’s a queue_config tuned for a high-throughput Prometheus, with the knobs that actually move the needle:

remote_write:
  - url: http://mimir:8080/api/v1/push
    queue_config:
      capacity: 10000          # per-shard queue depth
      max_shards: 50           # ceiling on parallelism
      min_shards: 4
      max_samples_per_send: 2000
      batch_send_deadline: 5s
      min_backoff: 30ms
      max_backoff: 5s
    metadata_config:
      send: true
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_gc_.*|process_.*'
        action: drop

A few of these deserve real explanation, because the docs make them sound interchangeable and they aren’t.

Shards: let it scale, but cap it

Prometheus auto-scales shards between min_shards and max_shards based on backlog. The danger is max_shards being too low (it can’t catch up, queues fill, drops) or too high (it overwhelms the remote endpoint, which 429s, which causes retries, which makes it worse). Start with max_shards around your peak shards_desired plus headroom, and watch the actual shard count rather than guessing.

capacity and max_samples_per_send

capacity is how deep each shard’s queue goes before dropping. max_samples_per_send is the batch size per HTTP request. Bigger batches amortize request overhead but raise latency and memory. For most backends 1000–2000 samples per send is the sweet spot; push higher only if the endpoint clearly handles it.

write_relabel_configs: drop before you ship

The cheapest fix for remote-write pressure is sending less. write_relabel_configs lets you drop entire metrics or labels before they hit the queue. go_gc_*, process_*, and other runtime noise rarely justify their network and storage cost downstream. This is the same cardinality discipline from taming Prometheus metric cardinality, applied at the shipping boundary.

The four metrics that tell you the truth

Tuning blind is hopeless. Remote write exposes its own state, and these four queries are the dashboard I live on:

# 1. Are we dropping samples? This should be ZERO.
rate(prometheus_remote_storage_samples_dropped_total[5m])

# 2. How far behind is the queue? (samples pending)
prometheus_remote_storage_samples_pending

# 3. Are shards maxed out and still behind?
prometheus_remote_storage_shards
  == prometheus_remote_storage_shards_max

# 4. How old is the data we're shipping? (lag)
prometheus_remote_storage_highest_timestamp_in_seconds
  - prometheus_remote_storage_queue_highest_sent_timestamp_seconds

That last one — the lag between the newest sample we have and the newest we’ve successfully sent — is the single best health signal. If it climbs and stays up, you’re falling behind and a drop is coming. Alert on it before samples_dropped ever fires.

- alert: RemoteWriteLagging
  expr: |
    (prometheus_remote_storage_highest_timestamp_in_seconds
     - prometheus_remote_storage_queue_highest_sent_timestamp_seconds) > 120
  for: 10m
  labels: { severity: warning }

The WAL is your buffer, respect it

When the remote endpoint goes down, Prometheus keeps the unsent samples in the WAL and replays them when the endpoint returns — that’s your durability cushion. But it’s bounded by disk and WAL retention. A long outage plus a short WAL means data loss on recovery. If you ship to a remote backend you care about, give the WAL enough disk to ride out a realistic outage window, and monitor prometheus_tsdb_wal_corruptions_total.

429s: when the problem is the other end

If the remote endpoint returns 429 Too Many Requests, more shards make it worse, not better. The right move is to back off and shed load: tighten write_relabel_configs to send fewer series, lower max_shards, and talk to whoever runs the backend about ingestion limits. Watch:

rate(prometheus_remote_storage_samples_failed_total[5m])

Failed (vs dropped) usually means the endpoint is rejecting you — a different fix entirely from queue exhaustion.

A sane tuning loop

I don’t tune remote write by reading docs and guessing. The loop:

Set a generous max_shards and reasonable capacity.
Aggressively drop runtime/debug metrics with write_relabel_configs.
Watch lag and samples_pending under real peak load.
If lag climbs with shards maxed → raise max_shards (if the endpoint can take it) or drop more series.
If you see 429s → lower shards and shed load instead.
Alert on lag, not just drops, so you act before data is lost.

Wire those alerts through your normal monitoring alert routing and a lagging remote-write queue becomes a tractable warning instead of a silent gap in your long-term metrics.

Remote-write internals and metric names change between Prometheus versions. Validate every setting and query against your own deployment before tuning production.