Prometheus Error Guide: 'replaying WAL' Slow Startup and Not-Ready Failure
Fix slow Prometheus 'replaying WAL' startup: stop the restart loop, switch a killing livenessProbe to a startupProbe, add memory headroom, and shrink the head.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #tsdb
Exact Error Message
There is no level=error line here — that is what makes this confusing. Prometheus is working, it is just slow to become ready. On startup it logs that it is replaying the write-ahead log and rebuilding the in-memory head:
level=info ts=2026-06-27T08:41:02.118Z caller=head.go:610 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2026-06-27T08:41:09.402Z caller=head.go:681 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=7.284s
level=info ts=2026-06-27T08:41:09.402Z caller=head.go:687 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2026-06-27T08:42:51.776Z caller=head.go:723 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=482
level=info ts=2026-06-27T08:44:30.011Z caller=head.go:723 component=tsdb msg="WAL segment loaded" segment=240 maxSegment=482
level=info ts=2026-06-27T08:48:55.402Z caller=head.go:723 component=tsdb msg="WAL segment loaded" segment=482 maxSegment=482
level=info ts=2026-06-27T08:48:55.733Z caller=head.go:760 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=12.3s wal_replay_duration=7m41s total_replay_duration=7m53s
level=info ts=2026-06-27T08:48:55.901Z caller=main.go:1063 msg="TSDB started"
The whole time those segments are loading, the readiness endpoint returns 503:
$ curl -s -o /dev/null -w '%{http_code}\n' http://localhost:9090/-/ready
503
$ curl -s http://localhost:9090/-/ready
Service Unavailable
Prometheus is not ready to serve traffic.
In Kubernetes this frequently turns into a crash loop, because the probe kills the pod before the replay finishes:
Warning Unhealthy kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
Normal Killing kubelet Container prometheus failed liveness probe, will be restarted
Last State: Terminated Reason: OOMKilled / Reason: Error
State: Waiting Reason: CrashLoopBackOff
What the Error Means
Prometheus keeps recent samples in an in-memory head block, durably backed by a write-ahead log (WAL) and memory-mapped head chunks on disk. The head itself is not persisted as a structure — on every start Prometheus reconstructs it by replaying the m-mapped chunks and then every WAL segment in order. Until that replay finishes, the head does not exist, so Prometheus cannot scrape, cannot answer queries, and reports /-/ready as 503.
This is entirely normal and not corruption. Replay duration scales with two things: the number of active series in the head and the total size of the WAL (which grows with uptime). A small head replays in seconds; a head with millions of active series and a multi-gigabyte WAL can take many minutes. The maxSegment value in the log is your progress bar — segment=240 maxSegment=482 means you are roughly halfway.
It becomes an incident only when (a) the replay simply takes too long and downstream alerting/dashboards are blind, (b) a too-aggressive liveness or startup probe kills the process mid-replay, producing an infinite crash loop where each restart begins the same long replay from scratch, or (c) the process OOMs because rebuilding a large head needs more memory than the limit allows. This guide is about a healthy but slow replay; if the log instead shows opening storage failed with an invalid checksum or a named segment offset, that is genuine corruption — see opening storage failed / WAL corruption.
Common Causes
- Very large head — millions of active series. Replay cost is dominated by series count. High-cardinality scrape targets inflate the head and the WAL together.
- Huge WAL from long uptime. The WAL accumulates segments between checkpoints; a Prometheus that has been up for a long time with many series carries a large WAL to replay.
- Slow disk. Network-attached or burst-credit storage (EBS
gp2, throttled IOPS, NFS) reads WAL segments slowly, stretching replay from seconds to minutes. - A killing
livenessProbe(the #1 cause of replay crash loops). A liveness probe with a shortinitialDelaySeconds/failureThresholdsees the 503, declares the container unhealthy, and kills it before replay completes — so it restarts and replays from the beginning, forever. AstartupProbeis the correct fix. - Insufficient memory. The reconstructed head must fit in RAM; a large head against a tight memory limit OOMs during replay, also producing a crash loop.
- Frequent restarts compounding the problem. Every restart pays the full replay cost again; flapping nodes or rolling restarts can keep Prometheus perpetually not-ready.
How to Reproduce the Error
Build a large head, then restart and watch the replay block readiness:
# Run a high-cardinality load so the head holds many active series,
# leave Prometheus up long enough to accumulate WAL segments, then:
systemctl restart prometheus
# Poll readiness during startup — it stays 503 for the whole replay:
while true; do
curl -s -o /dev/null -w '%{http_code} ' http://localhost:9090/-/ready
sleep 2
done
To reproduce the Kubernetes crash loop, set a livenessProbe with initialDelaySeconds: 30 and failureThreshold: 3 on a Prometheus whose replay takes minutes. The kubelet kills it at ~60s, it restarts, and kubectl get pod cycles through Running -> Unhealthy -> CrashLoopBackOff.
Diagnostic Commands
All of the following are read-only. First, confirm it is replaying (503) and watch progress:
# 503 while replaying, 200 once the head is rebuilt
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:9090/-/ready
# Progress: current segment vs maxSegment is your "% done"
journalctl -u prometheus -n 120 --no-pager | grep -iE 'replay|WAL segment'
Size the WAL to estimate how long replay will take:
du -sh /var/lib/prometheus/data/wal
ls /var/lib/prometheus/data/wal | wc -l # segment count
In Kubernetes, confirm a probe is the killer and check for OOM:
# Look for "Liveness probe failed", OOMKilled, and a climbing restart count
kubectl describe pod <prom-pod> | grep -iE 'liveness|oomkilled|restart|state|reason'
# The previous container's log shows it was mid-replay when killed
kubectl logs <prom-pod> --previous | grep -i replay
Once Prometheus is actually up, confirm the head size that is driving the cost:
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats'
{ "numSeries": 4821044, "numLabelPairs": 18233, "chunkCount": 9120388,
"minTime": 1719388800000, "maxTime": 1719475200000 }
Nearly five million active series explains a multi-minute replay — that is the real lever to pull.
Step-by-Step Resolution
1. Stop restarting it — let the replay finish. This is the single most important step. The replay is making forward progress; killing it throws away the work and starts over. Confirm progress with current segment vs maxSegment in the log and simply wait. Pause any rolling restart, drain protection, or manual systemctl restart until you see WAL replay completed and TSDB started.
2. In Kubernetes, switch the killing livenessProbe to a startupProbe. A startup probe suspends the liveness probe until the container has started, giving replay all the time it needs without ever being killed. Give it a generous budget (failureThreshold * periodSeconds should exceed your worst-case replay):
startupProbe:
httpGet:
path: /-/ready
port: 9090
# 60 * 10s = 10 minutes of grace before liveness takes over
periodSeconds: 10
failureThreshold: 60
livenessProbe:
httpGet:
path: /-/healthy # /-/healthy, not /-/ready — it's up even while replaying
port: 9090
periodSeconds: 15
failureThreshold: 6
readinessProbe:
httpGet:
path: /-/ready
port: 9090
periodSeconds: 10
failureThreshold: 3
Note the split: liveness uses /-/healthy (the process is alive throughout replay), readiness uses /-/ready (so traffic and the Service endpoint wait correctly), and the startupProbe absorbs the long replay so liveness never fires early.
3. Give it memory headroom for replay. Rebuilding a large head transiently needs more RAM than steady state. Raise the container memory limit (or node memory) so the process does not OOM mid-replay; pair this with OOMKilled / high memory tuning if the kill is memory-driven rather than probe-driven.
4. Use faster disk. Move the data directory off burst-credit/network storage onto provisioned-IOPS or local NVMe. WAL replay is read-heavy and disk latency directly sets replay time.
5. Shrink the head — the real long-term fix. Replay cost is dominated by active series count. Drop high-cardinality labels with metric_relabel_configs, scrape fewer/cheaper targets, and keep the head from ballooning. A smaller head means a smaller WAL and a fast replay every time.
6. Tune queue sizing only if needed and avoid unnecessary restarts. The default --storage.tsdb.head-chunks-write-queue-size is usually fine; change it only with evidence. Most importantly, eliminate the restart churn (node flapping, OOMs, aggressive rollouts) so you rarely pay the replay cost at all.
Prevention and Best Practices
- Always use a
startupProbefor Prometheus in Kubernetes, withfailureThreshold * periodSecondscomfortably above your worst observed replay time. Point liveness at/-/healthy, not/-/ready. - Control cardinality. Alert on
prometheus_tsdb_head_series; a runaway series count is the early warning that replay (and memory) is about to hurt. - Right-size memory with headroom above steady-state RSS so replay never OOMs.
- Use low-latency storage for the data directory; avoid burst-credit volumes for busy Prometheus servers.
- Minimize restarts and ship to a remote-write long-term store, so a slow local restart never means a blind window for alerting.
- Track replay duration over time (
prometheus_tsdb_wal_replay_duration_secondsis logged on each start) so a creeping startup time surfaces before it becomes an outage.
Related Errors
opening storage failed/ WAL corruption — a fatallevel=errorat startup naming an invalid checksum or a corrupt segment/block. That is truncation/corruption and Prometheus refuses to start; this guide is a healthy replay that is merely slow.- OOMKilled / high memory — if replay dies with
OOMKilledrather than a probe failure, the head is too large for the memory limit; raise the limit and reduce active series. Liveness probe failed: statuscode 503— the Kubernetes symptom of a too-aggressive liveness probe killing Prometheus mid-replay; fixed with thestartupProbeabove.
Frequently Asked Questions
Is a slow WAL replay the same as WAL corruption?
No. Corruption logs a fatal opening storage failed with an invalid checksum or named segment offset and the process exits non-zero. A slow replay logs info-level WAL segment loaded lines making steady progress toward maxSegment and eventually prints TSDB started. Slow replay is healthy; it just takes time.
Why is /-/ready returning 503 if Prometheus is running?
Because the in-memory head does not exist yet. Prometheus is alive (it answers /-/healthy) but cannot scrape or serve queries until the WAL replay rebuilds the head, so it reports not-ready by design.
My Prometheus is in CrashLoopBackOff during replay — why?
Almost always a livenessProbe killing it before replay finishes (it sees the 503), or an OOM because the head is too big for the memory limit. Switch to a startupProbe with a generous failureThreshold, point liveness at /-/healthy, and add memory headroom.
How can I make replay faster? Reduce active series (the dominant factor), put the data directory on faster/provisioned-IOPS disk, and avoid unnecessary restarts so you rarely pay the cost. A smaller head produces a smaller WAL and a quick replay.
Should I delete the WAL to skip the slow replay? No. Deleting the WAL discards all un-persisted head data (potentially hours of recent samples) and is the corruption-recovery path, not a startup speed-up. Let the replay finish; it is restoring real data.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.