Prometheus Error Guide: 'remote_write 429' Server Returned

Overview

Prometheus remote_write ships samples to a remote endpoint (Mimir, Thanos Receive, Cortex, VictoriaMetrics, a vendor) over HTTP. When the receiver rejects a batch, Prometheus logs server returned HTTP status <code> <reason> and retries (for retriable codes) or drops the batch. The two you will see most are 429 Too Many Requests (the receiver is rate-limiting or applying backpressure) and 400 Bad Request (the batch is malformed or violates a receiver limit such as max label name length or out-of-order samples).

You will see this in the Prometheus log:

ts=2026-06-23T14:11:09.882Z caller=dedupe.go:112 component=remote level=warn remote_name=mimir url=https://mimir:8080/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 429 Too Many Requests"

The non-retriable 400 variant is dropped, not retried:

err="server returned HTTP status 400 Bad Request: received a series whose number of labels exceeds the limit"

It is a sender/receiver contract problem: the scrape and local TSDB are healthy, but data is not reaching (or is being refused by) the remote store. Backlogs build in the local WAL until the queue drains or samples expire.

Symptoms

Repeated Failed to send batch, retrying (429) or dropped-batch (400) log lines for a remote_name.
Remote-write queue metrics show a growing backlog and rising failures.
Gaps in the remote/long-term store while the local Prometheus graphs look fine.
WAL/disk growth on the sender as undelivered samples accumulate.

rate(prometheus_remote_storage_samples_failed_total[5m]) > 0

prometheus_remote_storage_samples_pending

{remote_name="mimir", url="https://mimir:8080/api/v1/push"}  148221

Common Root Causes

1. Receiver rate-limiting (429) — ingestion limits exceeded

The receiver enforces a per-tenant samples/sec or series limit and returns 429 when you exceed it. Confirm the code distribution:

journalctl -u prometheus --no-pager | grep -oE 'HTTP status [0-9]+' | sort | uniq -c

   2841 HTTP status 429
     12 HTTP status 200

A flood of 429s means the remote is throttling. The fix is on the receiver (raise the tenant limit) or the sender (reduce series / shard).

2. Label limit violations (400) — too many or too-long labels

Receivers cap label count, label name/value length, and total series size. A high-cardinality metric trips a non-retriable 400:

err="server returned HTTP status 400 Bad Request: received a series whose label value length exceeds the limit, label: 'path', value: '/api/v2/...'"

Find offending series locally:

curl -s 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=topk(5, count by (__name__) ({__name__=~".+"}))' \
  | jq -r '.data.result[] | "\(.value[1])\t\(.metric.__name__)"'

98221	http_requests_total
41002	apiserver_request_duration_seconds_bucket

3. Out-of-order samples at the receiver (400)

If two senders (an HA pair) write the same series, the receiver rejects out-of-order samples with 400 unless its out-of-order window is enabled.

err="server returned HTTP status 400 Bad Request: out of order sample"

grep -nE 'replica|external_labels' /etc/prometheus/prometheus.yml

external_labels:
  cluster: prod
# (no replica label -> HA pair produces identical series)

Missing a replica external label on an HA pair causes colliding, out-of-order writes.

4. Queue saturation / insufficient shards

The remote-write queue cannot keep up; shards are maxed and samples pile up as pending even without server errors. Inspect shard behavior:

prometheus_remote_storage_shards
prometheus_remote_storage_shards_max

prometheus_remote_storage_shards{remote_name="mimir"}      200
prometheus_remote_storage_shards_max{remote_name="mimir"}  200

Shards pinned at shards_max with a growing samples_pending means you need more max_shards or max_samples_per_send.

5. Auth/timeout misconfiguration (401/403/timeouts surfacing as send failures)

A wrong bearer token, expired credential, or too-short remote_timeout shows up as send failures alongside the HTTP-status errors.

curl -s -o /dev/null -w '%{http_code}\n' -X POST \
  -H 'Authorization: Bearer <TOKEN>' \
  --data-binary @/dev/null https://mimir:8080/api/v1/push

A direct probe returning 401/403 isolates an auth problem from a rate-limit one.

6. Payload too large (413) or compression mismatch

A max_samples_per_send set too high can exceed the receiver’s request size limit, returning 413; this is fixed by lowering the batch size.

err="server returned HTTP status 413 Request Entity Too Large"

grep -nA3 'queue_config' /etc/prometheus/prometheus.yml

queue_config:
  max_samples_per_send: 20000

A 20k-sample batch may exceed the receiver’s body limit; reduce to a supported size.

Diagnostic Workflow

Step 1: Tally the HTTP status codes

journalctl -u prometheus --no-pager | grep 'remote' | grep -oE 'HTTP status [0-9]+ [A-Za-z ]+' | sort | uniq -c | sort -rn

429 vs 400 vs 401/413 immediately narrows the cause (backpressure vs malformed vs auth/size).

Step 2: Read the dropped-batch reason for 400s

journalctl -u prometheus --no-pager | grep -i 'remote' | grep -i 'Bad Request' | tail -5

The receiver’s message names the exact limit (label length, label count, out of order).

Step 3: Check queue health

prometheus_remote_storage_samples_pending
prometheus_remote_storage_shards / prometheus_remote_storage_shards_max
rate(prometheus_remote_storage_samples_failed_total[5m])

Rising pending samples with shards at max means a throughput/sharding problem; failures without backlog point at rejected (400) data.

Step 4: Probe the endpoint directly

curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' -X POST \
  -H 'Content-Type: application/x-protobuf' -H 'Authorization: Bearer <TOKEN>' \
  --data-binary @/dev/null https://mimir:8080/api/v1/push

Isolates auth (401/403), reachability, and latency from sample-content problems.

Step 5: Review queue_config and external_labels

grep -nA8 'remote_write' /etc/prometheus/prometheus.yml

Confirm max_shards, max_samples_per_send, remote_timeout, and the replica/external_labels for HA.

Example Root Cause Analysis

After enabling long-term storage, a sender logs a steady stream of server returned HTTP status 429 Too Many Requests and the remote store is missing the most recent two hours.

Tallying codes and checking the queue:

journalctl -u prometheus --no-pager | grep remote | grep -oE 'HTTP status [0-9]+' | sort | uniq -c

   3922 HTTP status 429

prometheus_remote_storage_samples_pending{remote_name="mimir"}

Every batch is throttled and 1.2M samples are backed up. The Mimir tenant’s ingestion_rate limit is set to the default 25,000 samples/sec, but this Prometheus is producing ~90,000 samples/sec. The sender is healthy; the receiver limit is too low.

The fix raises the tenant ingestion limit on the receiver and reduces what is shipped by dropping noisy series with a write_relabel_configs drop rule:

remote_write:
  - url: https://mimir:8080/api/v1/push
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'go_gc_.*|go_memstats_.*'
        action: drop
    queue_config:
      max_shards: 400

With the tenant limit raised and Go runtime internals dropped from the write path, the 429s stop, samples_pending drains to near zero, and the recent window backfills.

Prevention Best Practices

Right-size receiver ingestion limits (samples/sec, series, label length/count) to your actual emission rate, and alert on the receiver’s discarded-samples metric so 429/400 surfaces before backlogs grow.
Trim the write path with write_relabel_configs drop rules — ship only what the long-term store needs, not Go runtime and per-request high-cardinality internals.
Tune queue_config (max_shards, max_samples_per_send, min_backoff) for throughput, and watch prometheus_remote_storage_samples_pending and shards/shards_max.
For HA senders, set a replica external label and enable the receiver’s out-of-order window (or dedupe) to avoid 400 out-of-order rejections.
Alert on rate(prometheus_remote_storage_samples_failed_total[5m]) > 0 so any sustained send failure pages you.
The free incident assistant can classify remote-write failures by HTTP code and point at the limit being hit; more remote-write guidance is under Prometheus and monitoring.

Quick Command Reference

# Tally HTTP status codes from remote-write
journalctl -u prometheus --no-pager | grep remote \
  | grep -oE 'HTTP status [0-9]+ [A-Za-z ]+' | sort | uniq -c | sort -rn

# Read 400 dropped-batch reasons
journalctl -u prometheus --no-pager | grep remote | grep -i 'Bad Request' | tail -5

# Probe the receiver endpoint directly
curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' -X POST \
  -H 'Authorization: Bearer <TOKEN>' --data-binary @/dev/null <PUSH_URL>

# Review remote_write config
grep -nA8 'remote_write' /etc/prometheus/prometheus.yml

# Queue and failure health
rate(prometheus_remote_storage_samples_failed_total[5m])
prometheus_remote_storage_samples_pending
prometheus_remote_storage_shards / prometheus_remote_storage_shards_max

Conclusion

Remote-write server returned HTTP status errors mean the receiver refused a batch while your local Prometheus is fine. Triage by code:

Tally the status codes — 429 (backpressure) vs 400 (malformed/limit) vs 401/413 (auth/size).
For 400s, read the receiver’s reason: label length/count or out-of-order sample.
Check samples_pending and shards/shards_max for queue saturation.
Probe the endpoint directly to isolate auth and reachability.
Review queue_config and HA replica/external_labels.

The lasting fix pairs a correctly sized receiver limit with a trimmed write path and tuned queue — ship less, faster, within the receiver’s contract.

Prometheus Error Guide: 'remote_write 429' Server Returned HTTP Status 400