Prometheus Error Guide: 'compaction failed' TSDB Block Corruption
Fix Prometheus 'compaction failed' errors: remove corrupt blocks, free disk space, recover from unclean shutdowns, and restore from snapshots without losing your TSDB.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #tsdb
Exact Error Message
compaction failed is a TSDB storage error. Prometheus periodically compacts the in-memory head and small persistent blocks into larger, deduplicated blocks under the data directory. When that process hits a corrupt chunk, a malformed block, or runs out of disk mid-write, the compaction loop logs a failure and the block stays uncompacted.
ts=2026-06-27T03:14:21.882Z caller=db.go:885 level=error component=tsdb msg="compaction failed" err="populate block: chunk 8 out of bounds"
You may also see any of these variants:
err="open block /var/lib/prometheus/data/01J9X...: invalid magic number"
err="found unsequential head chunks"
err="opening block /var/lib/prometheus/data/01J9X... failed: read meta.json: unexpected end of JSON input"
err="populate block: write chunks: write /var/lib/prometheus/data/.tmp/.../chunks/000001: no space left on device"
What the Error Means
Compaction is the background job that merges the head block (recent in-memory data plus the WAL) and existing on-disk blocks into fewer, larger immutable blocks. Each block lives in a ULID-named directory containing chunks/, an index, a meta.json, and a tombstones file.
compaction failed means the compactor could not read a source block or write the destination block. Prometheus keeps serving queries and ingesting samples, but the failing compaction is retried every cycle and never succeeds. The head block keeps growing, memory climbs, and disk usage rises because old blocks are never cleaned up. Left alone, a stuck compaction eventually causes an out-of-memory crash or a full disk.
Common Causes
- A corrupt persistent block — a
meta.json,index, or chunk file with a bad magic number or truncated content, usually after a crash or bad disk write. - Disk filled mid-compaction — compaction writes the new block into
data/.tmp/before renaming it; if the disk fills, the partial block is left behind and corrupt. - Process killed during compaction — an OOM kill or
SIGKILLmid-write leaves an incomplete block or unsequential head chunks. - mmap chunk corruption —
chunks_head/mmapped files damaged by an unclean shutdown triggerfound unsequential head chunks. - Underlying filesystem issues — bit rot, a failing disk, or an NFS/overlay filesystem that does not honor
fsyncordering. - Too many head series — extreme cardinality makes each compaction enormous and slow, increasing the window in which a crash leaves a half-written block.
- Bug surfacing after an unclean shutdown — replaying a damaged WAL into the head produces chunks that later fail to compact.
How to Reproduce the Error
The cleanest reproduction is filling the disk during a compaction, or truncating a block file:
# Truncate a chunk file in an existing block to simulate corruption (TEST DATA ONLY)
BLOCK=$(ls -dt /var/lib/prometheus/data/01* | head -1)
truncate -s -1k "$BLOCK/chunks/000001"
# Force a compaction by restarting; the next compaction cycle fails
sudo systemctl restart prometheus
To reproduce the disk-full variant, point Prometheus at a small tmpfs and ingest a high-cardinality load until data/.tmp/ cannot be written.
Diagnostic Commands
Check the TSDB status and whether compactions are failing:
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data | {headStats, seriesCountByMetricName: .seriesCountByMetricName[0:3]}'
# Compaction failures climbing means a block is stuck
increase(prometheus_tsdb_compactions_failed_total[1h])
# Reloads/compactions not advancing while head grows
prometheus_tsdb_head_series
prometheus_tsdb_lowest_timestamp_seconds
Read the log to find the offending block ULID and the exact err=:
journalctl -u prometheus --no-pager | grep -iE 'compaction failed|out of bounds|unsequential|invalid magic|no space' | tail -20
List the blocks and verify each one with promtool:
ls -lt /var/lib/prometheus/data/
for b in /var/lib/prometheus/data/01*; do
echo "== $b =="
promtool tsdb analyze /var/lib/prometheus/data "$(basename "$b")" 2>&1 | head -5
done
== /var/lib/prometheus/data/01J9X7YQ8 ==
Block ID: 01J9X7YQ8...
Duration: 2h0m0s
Series: 184302
Label names: 51
...
== /var/lib/prometheus/data/01J9X8FB2 ==
open block: read meta.json: unexpected end of JSON input
Check disk space (the disk-full variant is the most common):
df -h /var/lib/prometheus/data
du -sh /var/lib/prometheus/data/.tmp 2>/dev/null
Step-by-Step Resolution
1. Stop Prometheus and back up the data directory. Never delete blocks from a running server, and always keep a copy before surgery.
sudo systemctl stop prometheus
sudo cp -a /var/lib/prometheus/data /var/lib/prometheus/data.bak.$(date +%s)
2. If the disk is full, free space first. Remove the leftover .tmp directory and expand the volume or reduce retention.
sudo rm -rf /var/lib/prometheus/data/.tmp
df -h /var/lib/prometheus/data
3. Identify the corrupt block. The block whose ULID appears in the err= line, or the one that fails promtool tsdb analyze, is the culprit.
4. Remove (do not just rename inside the data dir) the corrupt block. Move it out of the data directory entirely so Prometheus does not try to load it.
sudo mv /var/lib/prometheus/data/01J9X8FB2... /var/lib/prometheus/corrupt-blocks/
5. For found unsequential head chunks, clear the damaged head mmap files. The head will rebuild from the WAL on start; if the WAL itself is damaged, see the WAL-corruption guide below.
sudo mv /var/lib/prometheus/data/chunks_head /var/lib/prometheus/chunks_head.bak
6. Start Prometheus and let the head re-compact. Watch the log; the next compaction cycle should now succeed.
sudo systemctl start prometheus
journalctl -u prometheus -f | grep -iE 'compact|tsdb'
7. Confirm recovery. prometheus_tsdb_compactions_failed_total should stop increasing and the block count should shrink as compaction catches up.
If you cannot afford to lose the corrupt block’s data, restore that time range from a TSDB snapshot taken via the admin API (POST /api/v1/admin/tsdb/snapshot) or from object-storage backups.
Prevention and Best Practices
- Provision the data volume with comfortable headroom (2–3x peak block size) and alert on
prometheus_tsdb_compactions_failed_total > 0. - Take periodic snapshots with the admin TSDB snapshot API so you can restore a clean block instead of discarding data.
- Keep cardinality under control so each compaction is bounded — see the OOMKilled / high memory guide.
- Use a real local filesystem (ext4/xfs) that honors
fsync; avoid NFS and thin-provisioned volumes that can silently fill. - Shut down cleanly (
systemctl stop, notkill -9) so the WAL and head chunks are flushed in order. - Alert on disk usage at 80% so you never fill the volume mid-compaction.
Related Errors
- Prometheus ‘no space left on device’ / TSDB disk full — the disk-pressure variant that often triggers a failed compaction.
- Prometheus ‘opening storage failed’ WAL corruption — when the WAL itself is damaged and the head cannot replay.
- Prometheus OOMKilled / high memory — huge head blocks from cardinality make compactions fragile.
Frequently Asked Questions
Will I lose data if I delete the corrupt block? You lose only the samples inside that single block (typically a 2h or 24h window). Everything in other blocks and the current head is untouched. Move the block aside rather than rm it so you can attempt a snapshot restore later.
Can I repair a corrupt block instead of deleting it? There is no official promtool repair for a block. promtool tsdb analyze only inspects. If the index is intact but a chunk is bad, you can sometimes recover with a snapshot/restore, but in practice removing the block and accepting the gap is the reliable fix.
Why does compaction keep failing every few minutes? Compaction runs on a cycle. As long as the corrupt block or the unsequential head chunk is present, every cycle retries and fails. The error stops only once you remove the bad block or clear the damaged head.
Is found unsequential head chunks the same as a corrupt block? No. That message points at the in-memory head’s mmapped chunk files (chunks_head/), not a persisted block. Clearing chunks_head and replaying the WAL usually fixes it, whereas a corrupt persistent block must be removed from data/.
How do I prevent this after an OOM kill? Stop the OOM kills. High cardinality is the usual root cause; cap series with metric_relabel_configs, add memory headroom, and alert on prometheus_tsdb_head_series. A clean shutdown almost never leaves a failed compaction.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.