Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 9 min read

Prometheus Error Guide: 'replaying WAL' Slow Startup and Not-Ready Failure

Fix slow Prometheus 'replaying WAL' startup: stop the restart loop, switch a killing livenessProbe to a startupProbe, add memory headroom, and shrink the head.

  • #prometheus-monitoring
  • #troubleshooting
  • #errors
  • #tsdb

Exact Error Message

There is no level=error line here — that is what makes this confusing. Prometheus is working, it is just slow to become ready. On startup it logs that it is replaying the write-ahead log and rebuilding the in-memory head:

level=info ts=2026-06-27T08:41:02.118Z caller=head.go:610 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2026-06-27T08:41:09.402Z caller=head.go:681 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=7.284s
level=info ts=2026-06-27T08:41:09.402Z caller=head.go:687 component=tsdb msg="Replaying WAL, this may take a while"
level=info ts=2026-06-27T08:42:51.776Z caller=head.go:723 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=482
level=info ts=2026-06-27T08:44:30.011Z caller=head.go:723 component=tsdb msg="WAL segment loaded" segment=240 maxSegment=482
level=info ts=2026-06-27T08:48:55.402Z caller=head.go:723 component=tsdb msg="WAL segment loaded" segment=482 maxSegment=482
level=info ts=2026-06-27T08:48:55.733Z caller=head.go:760 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=12.3s wal_replay_duration=7m41s total_replay_duration=7m53s
level=info ts=2026-06-27T08:48:55.901Z caller=main.go:1063 msg="TSDB started"

The whole time those segments are loading, the readiness endpoint returns 503:

$ curl -s -o /dev/null -w '%{http_code}\n' http://localhost:9090/-/ready
503

$ curl -s http://localhost:9090/-/ready
Service Unavailable
Prometheus is not ready to serve traffic.

In Kubernetes this frequently turns into a crash loop, because the probe kills the pod before the replay finishes:

Warning  Unhealthy  kubelet  Liveness probe failed: HTTP probe failed with statuscode: 503
Normal   Killing    kubelet  Container prometheus failed liveness probe, will be restarted
Last State:  Terminated  Reason: OOMKilled  /  Reason: Error
State:       Waiting     Reason: CrashLoopBackOff

What the Error Means

Prometheus keeps recent samples in an in-memory head block, durably backed by a write-ahead log (WAL) and memory-mapped head chunks on disk. The head itself is not persisted as a structure — on every start Prometheus reconstructs it by replaying the m-mapped chunks and then every WAL segment in order. Until that replay finishes, the head does not exist, so Prometheus cannot scrape, cannot answer queries, and reports /-/ready as 503.

This is entirely normal and not corruption. Replay duration scales with two things: the number of active series in the head and the total size of the WAL (which grows with uptime). A small head replays in seconds; a head with millions of active series and a multi-gigabyte WAL can take many minutes. The maxSegment value in the log is your progress bar — segment=240 maxSegment=482 means you are roughly halfway.

It becomes an incident only when (a) the replay simply takes too long and downstream alerting/dashboards are blind, (b) a too-aggressive liveness or startup probe kills the process mid-replay, producing an infinite crash loop where each restart begins the same long replay from scratch, or (c) the process OOMs because rebuilding a large head needs more memory than the limit allows. This guide is about a healthy but slow replay; if the log instead shows opening storage failed with an invalid checksum or a named segment offset, that is genuine corruption — see opening storage failed / WAL corruption.

Common Causes

  • Very large head — millions of active series. Replay cost is dominated by series count. High-cardinality scrape targets inflate the head and the WAL together.
  • Huge WAL from long uptime. The WAL accumulates segments between checkpoints; a Prometheus that has been up for a long time with many series carries a large WAL to replay.
  • Slow disk. Network-attached or burst-credit storage (EBS gp2, throttled IOPS, NFS) reads WAL segments slowly, stretching replay from seconds to minutes.
  • A killing livenessProbe (the #1 cause of replay crash loops). A liveness probe with a short initialDelaySeconds / failureThreshold sees the 503, declares the container unhealthy, and kills it before replay completes — so it restarts and replays from the beginning, forever. A startupProbe is the correct fix.
  • Insufficient memory. The reconstructed head must fit in RAM; a large head against a tight memory limit OOMs during replay, also producing a crash loop.
  • Frequent restarts compounding the problem. Every restart pays the full replay cost again; flapping nodes or rolling restarts can keep Prometheus perpetually not-ready.

How to Reproduce the Error

Build a large head, then restart and watch the replay block readiness:

# Run a high-cardinality load so the head holds many active series,
# leave Prometheus up long enough to accumulate WAL segments, then:
systemctl restart prometheus

# Poll readiness during startup — it stays 503 for the whole replay:
while true; do
  curl -s -o /dev/null -w '%{http_code} ' http://localhost:9090/-/ready
  sleep 2
done

To reproduce the Kubernetes crash loop, set a livenessProbe with initialDelaySeconds: 30 and failureThreshold: 3 on a Prometheus whose replay takes minutes. The kubelet kills it at ~60s, it restarts, and kubectl get pod cycles through Running -> Unhealthy -> CrashLoopBackOff.

Diagnostic Commands

All of the following are read-only. First, confirm it is replaying (503) and watch progress:

# 503 while replaying, 200 once the head is rebuilt
curl -s -o /dev/null -w '%{http_code}\n' http://localhost:9090/-/ready

# Progress: current segment vs maxSegment is your "% done"
journalctl -u prometheus -n 120 --no-pager | grep -iE 'replay|WAL segment'

Size the WAL to estimate how long replay will take:

du -sh /var/lib/prometheus/data/wal
ls /var/lib/prometheus/data/wal | wc -l   # segment count

In Kubernetes, confirm a probe is the killer and check for OOM:

# Look for "Liveness probe failed", OOMKilled, and a climbing restart count
kubectl describe pod <prom-pod> | grep -iE 'liveness|oomkilled|restart|state|reason'

# The previous container's log shows it was mid-replay when killed
kubectl logs <prom-pod> --previous | grep -i replay

Once Prometheus is actually up, confirm the head size that is driving the cost:

curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats'
{ "numSeries": 4821044, "numLabelPairs": 18233, "chunkCount": 9120388,
  "minTime": 1719388800000, "maxTime": 1719475200000 }

Nearly five million active series explains a multi-minute replay — that is the real lever to pull.

Step-by-Step Resolution

1. Stop restarting it — let the replay finish. This is the single most important step. The replay is making forward progress; killing it throws away the work and starts over. Confirm progress with current segment vs maxSegment in the log and simply wait. Pause any rolling restart, drain protection, or manual systemctl restart until you see WAL replay completed and TSDB started.

2. In Kubernetes, switch the killing livenessProbe to a startupProbe. A startup probe suspends the liveness probe until the container has started, giving replay all the time it needs without ever being killed. Give it a generous budget (failureThreshold * periodSeconds should exceed your worst-case replay):

startupProbe:
  httpGet:
    path: /-/ready
    port: 9090
  # 60 * 10s = 10 minutes of grace before liveness takes over
  periodSeconds: 10
  failureThreshold: 60
livenessProbe:
  httpGet:
    path: /-/healthy   # /-/healthy, not /-/ready — it's up even while replaying
    port: 9090
  periodSeconds: 15
  failureThreshold: 6
readinessProbe:
  httpGet:
    path: /-/ready
    port: 9090
  periodSeconds: 10
  failureThreshold: 3

Note the split: liveness uses /-/healthy (the process is alive throughout replay), readiness uses /-/ready (so traffic and the Service endpoint wait correctly), and the startupProbe absorbs the long replay so liveness never fires early.

3. Give it memory headroom for replay. Rebuilding a large head transiently needs more RAM than steady state. Raise the container memory limit (or node memory) so the process does not OOM mid-replay; pair this with OOMKilled / high memory tuning if the kill is memory-driven rather than probe-driven.

4. Use faster disk. Move the data directory off burst-credit/network storage onto provisioned-IOPS or local NVMe. WAL replay is read-heavy and disk latency directly sets replay time.

5. Shrink the head — the real long-term fix. Replay cost is dominated by active series count. Drop high-cardinality labels with metric_relabel_configs, scrape fewer/cheaper targets, and keep the head from ballooning. A smaller head means a smaller WAL and a fast replay every time.

6. Tune queue sizing only if needed and avoid unnecessary restarts. The default --storage.tsdb.head-chunks-write-queue-size is usually fine; change it only with evidence. Most importantly, eliminate the restart churn (node flapping, OOMs, aggressive rollouts) so you rarely pay the replay cost at all.

Prevention and Best Practices

  • Always use a startupProbe for Prometheus in Kubernetes, with failureThreshold * periodSeconds comfortably above your worst observed replay time. Point liveness at /-/healthy, not /-/ready.
  • Control cardinality. Alert on prometheus_tsdb_head_series; a runaway series count is the early warning that replay (and memory) is about to hurt.
  • Right-size memory with headroom above steady-state RSS so replay never OOMs.
  • Use low-latency storage for the data directory; avoid burst-credit volumes for busy Prometheus servers.
  • Minimize restarts and ship to a remote-write long-term store, so a slow local restart never means a blind window for alerting.
  • Track replay duration over time (prometheus_tsdb_wal_replay_duration_seconds is logged on each start) so a creeping startup time surfaces before it becomes an outage.
  • opening storage failed / WAL corruption — a fatal level=error at startup naming an invalid checksum or a corrupt segment/block. That is truncation/corruption and Prometheus refuses to start; this guide is a healthy replay that is merely slow.
  • OOMKilled / high memory — if replay dies with OOMKilled rather than a probe failure, the head is too large for the memory limit; raise the limit and reduce active series.
  • Liveness probe failed: statuscode 503 — the Kubernetes symptom of a too-aggressive liveness probe killing Prometheus mid-replay; fixed with the startupProbe above.

Frequently Asked Questions

Is a slow WAL replay the same as WAL corruption? No. Corruption logs a fatal opening storage failed with an invalid checksum or named segment offset and the process exits non-zero. A slow replay logs info-level WAL segment loaded lines making steady progress toward maxSegment and eventually prints TSDB started. Slow replay is healthy; it just takes time.

Why is /-/ready returning 503 if Prometheus is running? Because the in-memory head does not exist yet. Prometheus is alive (it answers /-/healthy) but cannot scrape or serve queries until the WAL replay rebuilds the head, so it reports not-ready by design.

My Prometheus is in CrashLoopBackOff during replay — why? Almost always a livenessProbe killing it before replay finishes (it sees the 503), or an OOM because the head is too big for the memory limit. Switch to a startupProbe with a generous failureThreshold, point liveness at /-/healthy, and add memory headroom.

How can I make replay faster? Reduce active series (the dominant factor), put the data directory on faster/provisioned-IOPS disk, and avoid unnecessary restarts so you rarely pay the cost. A smaller head produces a smaller WAL and a quick replay.

Should I delete the WAL to skip the slow replay? No. Deleting the WAL discards all un-persisted head data (potentially hours of recent samples) and is the corruption-recovery path, not a startup speed-up. Let the replay finish; it is restoring real data.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.