Prometheus Error Guide: 'no space left on device' TSDB Disk Full
Fix Prometheus 'no space left on device' TSDB errors: set retention size and time caps, free the data dir, cut cardinality, grow the disk, and offload long-term to remote write.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #tsdb
Exact Error Message
A full TSDB data directory surfaces as no space left on device on the write path, compaction, or WAL append:
ts=2026-06-27T11:58:03.661Z caller=db.go:933 level=error component=tsdb msg="compaction failed" err="write /var/lib/prometheus/data/01J...XYZ/chunks/000001: no space left on device"
level=error msg="Error on ingesting samples" err="write /var/lib/prometheus/data/wal/00004210: no space left on device"
level=error caller=db.go:885 component=tsdb msg="WAL checkpoint failed" err="no space left on device"
When this happens, head compaction stops, the WAL grows unbounded, and eventually Prometheus cannot ingest at all — it may also fail to restart cleanly.
What the Error Means
Prometheus writes everything under one data directory (default /prometheus, often /var/lib/prometheus/data): the WAL, the in-memory head flushed to chunk files, and immutable persistent blocks produced by compaction. no space left on device means the filesystem backing that directory has no free space (or no free inodes) to write the next chunk, WAL segment, or compacted block.
Unlike the ingestion-level errors (out of order, out of bounds, duplicate sample), this is an OS-level write failure: the TSDB is healthy in principle, it simply has nowhere to put bytes. It is also distinct from WAL corruption — there the WAL files are damaged; here they are fine but the volume is full.
Common Causes
- Retention too long with no size cap.
--storage.tsdb.retention.timeset high (or default 15 d) on a volume too small for the actual ingestion rate, and--storage.tsdb.retention.sizeunset. - Cardinality growth. A new label (pod, request ID, customer ID) explodes series count; head and blocks grow far faster than planned.
- WAL growth. Stalled compaction (often itself caused by an earlier disk-full event) leaves the WAL unable to checkpoint, so it keeps growing.
- No size-based retention. Only time retention is set, so a spike in ingestion fills the disk before the time window expires.
- Other consumers on the same volume. Logs, snapshots (
/api/v1/admin/tsdb/snapshot), or another service sharing the partition. - Snapshots left behind.
data/snapshots/directories from admin snapshot calls never cleaned up.
How to Reproduce the Error
On a small test volume, ingest aggressively with a long retention and no size cap:
# Tiny tmpfs to simulate a small data volume
sudo mount -t tmpfs -o size=64m tmpfs /tmp/promdata
prometheus --storage.tsdb.path=/tmp/promdata \
--storage.tsdb.retention.time=90d \
--config.file=/etc/prometheus/prometheus.yml &
# Generate high-cardinality load against it, then watch:
journalctl -u prometheus -f | grep -i 'no space left'
level=error component=tsdb msg="compaction failed" err="... no space left on device"
The 64 MB tmpfs fills within minutes and reproduces the exact failure.
Diagnostic Commands
Confirm the data volume is actually full (space and inodes):
df -h /var/lib/prometheus/data
df -i /var/lib/prometheus/data
Filesystem Size Used Avail Use% Mounted on
/dev/nvme1n1 100G 100G 12K 100% /var/lib/prometheus
See where the space went — blocks vs WAL vs snapshots:
du -sh /var/lib/prometheus/data/wal
du -sh /var/lib/prometheus/data/snapshots 2>/dev/null
du -sh /var/lib/prometheus/data/* | sort -h | tail
Read TSDB head and block stats from the API:
curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats'
{ "numSeries": 4821330, "chunkCount": 19204110, "minTime": ..., "maxTime": ... }
Check on-disk block size and series count via metrics:
prometheus_tsdb_storage_blocks_bytes
prometheus_tsdb_head_series
topk(10, count by (__name__) ({__name__=~".+"}))
Find the configured retention flags actually in use:
ps aux | grep -oE 'storage.tsdb.retention[^ ]*'
journalctl -u prometheus --no-pager | grep -iE 'retention|no space' | tail
Step-by-Step Resolution
1. Free enough space to let Prometheus run. Remove leftover snapshots and any non-Prometheus files on the volume; do not hand-delete block directories while Prometheus is running:
ls -lt /var/lib/prometheus/data/snapshots
rm -rf /var/lib/prometheus/data/snapshots/<old-snapshot-id>
df -h /var/lib/prometheus/data
2. Set a size-based retention cap. This is the durable fix — Prometheus will keep the data dir under a hard ceiling regardless of ingestion spikes. Edit the unit’s flags (e.g. /etc/default/prometheus or the systemd ExecStart):
--storage.tsdb.retention.size=80GB
--storage.tsdb.retention.time=30d
Leave ~15–20% headroom below the volume size for compaction scratch space. Whichever cap (size or time) is hit first triggers deletion.
3. Reload/restart. After editing flags, restart so they take effect (flag changes require a restart, not a reload):
sudo systemctl daemon-reload
sudo systemctl restart prometheus
journalctl -u prometheus -f | grep -iE 'retention|compaction'
Older blocks beyond the new caps are deleted on the next compaction cycle.
4. Reduce cardinality if series count is the driver. Use topk(... count by (__name__) ...) to find the worst offenders and drop high-cardinality labels at scrape time:
metric_relabel_configs:
- source_labels: [__name__]
regex: 'apiserver_request_duration_seconds_bucket'
action: drop
- regex: 'request_id|trace_id|pod_uid'
action: labeldrop
5. Grow the disk as a stopgap if data is genuinely needed at full resolution:
sudo growpart /dev/nvme1 1 && sudo resize2fs /dev/nvme1n1
df -h /var/lib/prometheus/data
6. Offload long-term storage. For multi-month history, keep local retention short and remote_write to a long-term store (Thanos, Mimir, Cortex) instead of growing local disk forever:
remote_write:
- url: https://mimir.internal/api/v1/push
Prevention and Best Practices
- Always set
--storage.tsdb.retention.sizein addition to (or instead of).time; it is the only cap that protects against ingestion spikes. - Provision the data volume for ~120% of expected steady-state block size and alert on
node_filesystem_avail_bytesunder 20%. - Track
prometheus_tsdb_head_seriesandprometheus_tsdb_storage_blocks_byteson a dashboard; alert on cardinality growth before it fills the disk. - Give Prometheus a dedicated volume — never share with logs or other services.
- Drop high-cardinality labels at scrape time with
metric_relabel_configsrather than after they have filled the disk. - Use
remote_writefor long-term retention; keep local retention to days, not months. - Clean up
data/snapshots/after anyadmin/tsdb/snapshotcall.
Related Errors
opening storage failed/ WAL corruption — damaged WAL segments on startup; distinct from a full disk (though a disk-full event can lead to truncated WAL files).compaction failed— the most common log line carryingno space left on device; compaction is where the failure usually first appears.mmap: cannot allocate memory— a related resource exhaustion (virtual memory /vm.max_map_count) rather than disk, with high block counts.
Frequently Asked Questions
Should I delete block directories by hand to free space? Not while Prometheus is running — you risk inconsistent state. Free space from snapshots and non-Prometheus files instead, then let size/time retention delete old blocks automatically. If you must delete blocks manually, stop Prometheus first and remove whole block ULID directories, never individual files.
What’s the difference between retention.time and retention.size?
retention.time deletes data older than a duration; retention.size deletes the oldest blocks once the data dir exceeds a byte ceiling. Set both — whichever limit is reached first applies. Only retention.size protects against a sudden ingestion or cardinality spike.
Do retention flag changes take effect on reload?
No. --storage.tsdb.retention.* are command-line flags, so you must restart Prometheus (not just POST /-/reload) for new values to apply.
Why did the WAL keep growing after the disk filled? A full disk stalls compaction and WAL checkpointing, so the WAL cannot be truncated and keeps accumulating segments — making the problem worse. Freeing space and restarting lets checkpointing resume and the WAL shrink.
Is remote write a replacement for local disk? It is for long-term storage. You still need local disk for the WAL, head, and short local retention, but offloading history to a remote store lets you keep local retention (and disk) small.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.