Prometheus Error Guide: 'remote_write server returned HTTP status 500' Receiver Failure
Fix Prometheus remote_write 500 errors: the receiver (Mimir, Thanos Receive, Cortex) is broken — check ingesters, object storage, and proxy timeouts, not Prometheus.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #remote-write
Exact Error Message
A 500 Internal Server Error on the remote-write path is reported by the queue_manager component in the Prometheus log:
level=error ts=2026-06-27T09:41:18.204Z caller=queue_manager.go:1043 component=remote remote_name=mimir-prod url=https://mimir.example.com/api/v1/push msg="non-recoverable error" count=500 exemplarCount=0 err="server returned HTTP status 500 Internal Server Error: rpc error: code = Unavailable desc = ingester is unavailable"
Before that, you usually see retries because Prometheus treats 5xx as recoverable and backs off:
level=warn ts=2026-06-27T09:40:52.118Z caller=queue_manager.go:1126 component=remote remote_name=mimir-prod url=https://mimir.example.com/api/v1/push msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error"
Only after exhausting max_retries/max_backoff does Prometheus give up and drop the batch, which shows up as non-recoverable error and a rising prometheus_remote_storage_samples_failed_total. While retries pile up, the WAL and in-memory queue back up and prometheus_remote_storage_samples_pending climbs.
What the Error Means
server returned HTTP status 500 Internal Server Error means the remote endpoint that Prometheus pushes to returned a 5xx — the failure is on the receiving side, not in your local Prometheus. The local Prometheus did its job: it serialised a batch of samples and POSTed them to the configured url. The 500 came back in the HTTP response from whatever lives at that URL.
That URL is almost never another vanilla Prometheus. It is a remote-write receiver: Grafana Mimir, Thanos Receive, Cortex, VictoriaMetrics, Grafana Cloud’s ingester, or a reverse proxy / auth gateway sitting in front of one of them. A 500 means that component (or something it depends on, like object storage) is broken or overloaded.
Crucially, this is not the same as a 429 / 400 remote-write rejection. A 429 is the receiver deliberately throttling you (back off, you’re sending too fast); a 400 is the receiver rejecting your data as malformed or out-of-limits. Those are client-side problems you fix on the sender. A 500 is the receiver saying “I broke” — you fix it on the receiver. Prometheus retries 5xx (recoverable) but does not retry most 4xx, which is why the operational response is completely different.
Common Causes
- Receiver OOMKilled or crash-looping. A Mimir/Cortex ingester or Thanos Receive pod that is restarting returns 500 (or
Unavailable) for in-flight pushes. - Receiver TSDB / WAL problems. Corrupt or full local storage on the ingester, replay in progress, or a stuck compaction.
- Ingester replication failures. In a replicated write path, if too few ingester replicas accept the sample the distributor returns 500.
- Hash ring not ready. Right after a rollout, the ring is unhealthy or under-replicated and the distributor cannot route writes.
- Downstream object storage errors. S3/GCS/Azure Blob returning 5xx or timing out causes the receiver to fail flushes and propagate a 500.
- Reverse proxy returning 500 on upstream timeout. nginx/Envoy/Traefik in front of the receiver maps an upstream timeout or connection reset to a 500/502.
- TLS termination misconfig. A proxy terminating TLS incorrectly (bad cert chain, SNI mismatch) can surface as a 5xx to the client.
- Body too large hitting a receiver limit. A push exceeding the receiver’s max request size that the receiver mishandles as a 500 instead of a clean 413/400.
- Auth proxy bug. A custom auth/multi-tenant gateway throwing an unhandled exception returns 500 for otherwise valid writes.
How to Reproduce the Error
Point Prometheus at a receiver, then take the receiver down (or break its dependency) and watch the sender:
# Stop the receiver (example: a local Mimir/Thanos Receive container)
docker stop mimir-receive
# Watch Prometheus react to the now-failing endpoint
journalctl -u prometheus -f | grep -i remote
level=warn caller=queue_manager.go:1126 component=remote remote_name=mimir-prod msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error"
You can also reproduce the proxy case by putting nginx in front with a 1s proxy_read_timeout and a deliberately slow upstream: the proxy returns a 5xx and Prometheus logs the same non-recoverable error once retries are exhausted.
Diagnostic Commands
Confirm which URL is actually failing — read the configured remote-write targets:
curl -s http://localhost:9090/api/v1/status/config | jq -r '.data.yaml' | grep -A6 remote_write
Tail the sender’s remote-write logs to confirm the 500 and which remote_name/url:
journalctl -u prometheus -n 50 --no-pager | grep -i remote
Probe the receiver’s own health endpoints directly (read-only) — this is the key step that proves the problem is downstream:
curl -s -o /dev/null -w '%{http_code}\n' https://mimir.example.com/-/ready
curl -s -o /dev/null -w '%{http_code}\n' https://mimir.example.com/-/healthy
Send a minimal probe to the receiver’s push endpoint to confirm it returns 500 (read-only — an empty/garbage body is rejected, but the status code tells you if the endpoint is up):
curl -i -s -X POST https://mimir.example.com/api/v1/push \
-H 'Content-Type: application/x-protobuf' \
-H 'X-Scope-OrgID: prod' \
--data-binary @/dev/null | head -1
Then inspect the receiver’s logs (the actual root cause lives here, not in Prometheus):
kubectl logs -l app.kubernetes.io/component=ingester -n mimir --tail=100 | grep -iE 'error|panic|oom|ring|s3'
journalctl -u thanos-receive -n 100 --no-pager | grep -iE 'error|level=error'
Watch the sender-side metrics that quantify the damage:
rate(prometheus_remote_storage_samples_failed_total[5m])
prometheus_remote_storage_samples_pending
prometheus_remote_storage_shards
rate(prometheus_remote_storage_retried_samples_total[5m])
A rising retried_samples_total with stable failed_total means retries are absorbing a transient receiver blip; a rising failed_total with climbing samples_pending means the receiver is down long enough that data is being dropped and the queue is backing up.
Step-by-Step Resolution
1. Confirm the failure is downstream, then fix the receiver — not the sender. The 500 originates at the url. Resist the urge to touch Prometheus’s queue config first; that does not fix a broken receiver and can mask the real problem.
2. Check receiver health and restart/recover crash-looping instances. If ingesters are OOMKilled or restarting, give them more memory and bring the ring back to a healthy, fully-replicated state:
kubectl get pods -n mimir | grep -E 'ingester|distributor'
kubectl describe pod <ingester-pod> -n mimir | grep -iE 'reason|oomkilled|restart'
3. Check downstream object storage. If the receiver’s logs show S3/GCS errors or timeouts, the 500 is really an object-store failure. Verify credentials, bucket reachability, and provider status; flushes cannot complete without it.
4. Scale ingesters / verify the hash ring. Under-replicated or recently-rolled rings return 500 until enough replicas are healthy. Confirm the ring page reports all members ACTIVE and scale out if writes outpace ingester capacity.
5. Fix proxy timeouts and body limits. If a reverse proxy sits in front, raise proxy_read_timeout/upstream timeouts and the max body size (e.g. client_max_body_size in nginx, grpc-max-recv-msg-size for gRPC) so a slow-but-healthy receiver isn’t mapped to a 500. Re-check TLS chain/SNI if the proxy terminates TLS.
6. Only as a secondary step, tune the sender’s queue. Once the receiver is healthy, smooth delivery so a future blip doesn’t drop data. Reduce batch size or grow buffering — but treat this as resilience, not a fix:
remote_write:
- url: https://mimir.example.com/api/v1/push
queue_config:
capacity: 10000
max_samples_per_send: 2000
max_shards: 50
min_backoff: 250ms
max_backoff: 30s
Reload with curl -X POST http://localhost:9090/-/reload. Larger capacity and max_backoff let the WAL ride out a brief receiver outage without dropping samples.
Prevention and Best Practices
- Alert on the receiver, not just the sender. Page on
rate(prometheus_remote_storage_samples_failed_total[5m]) > 0and on the receiver’s own error rate / readiness so you catch it before data is lost. - Give ingesters resource headroom and PodDisruptionBudgets so rollouts and node drains don’t drop the ring below quorum.
- Monitor downstream object storage (S3/GCS error rate and latency) as a first-class dependency of your metrics pipeline.
- Set generous proxy timeouts and body limits in front of the receiver, and prefer clean 413/429 responses over upstream-timeout 500s.
- Size the remote-write queue (
capacity,max_backoff) to survive your expected worst-case receiver outage without exhausting the WAL. - Watch
context deadline exceededalongside 500s — a slow receiver often produces both before it fully fails.
Related Errors
remote_write server returned HTTP status 429 / 400—429is the receiver throttling you (send slower);400is the receiver rejecting malformed or out-of-limit data. Both are sender-side fixes, unlike a 500.context deadline exceeded— the receiver accepted the connection but didn’t respond before the timeout; a frequent precursor to, or sibling of, a 500 from a slow/overloaded receiver.server returned HTTP status 503 Service Unavailable— a sibling 5xx, also retried, usually meaning the receiver is up but explicitly shedding load or not yet ready.
Frequently Asked Questions
Does a 500 mean my Prometheus is broken? No. The 500 is the HTTP status returned to Prometheus by the remote-write endpoint. Your local Prometheus is functioning; the receiver (Mimir, Thanos Receive, Cortex, VictoriaMetrics, Grafana Cloud, or a proxy in front of one) is the thing that failed.
Will Prometheus retry a 500, or is the data lost immediately?
Prometheus treats 5xx as recoverable and retries with backoff up to your queue limits. Data is only dropped after retries are exhausted, at which point you’ll see non-recoverable error and prometheus_remote_storage_samples_failed_total increase.
How is a 500 different from a 429 or 400?
A 429 means slow down (throttling) and a 400 means bad/oversized data — both are fixed on the sender. A 500 means the receiver itself is broken, so you fix the receiver. See the 429/400 guide.
Why do retries make the WAL grow?
While batches are failing and being retried, unsent samples accumulate in the in-memory queue (prometheus_remote_storage_samples_pending) and the WAL is held longer to back them. If the receiver stays down past your retention/queue capacity, samples are dropped and disk usage climbs until delivery resumes.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.