Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 9 min read

Prometheus Error Guide: 'no space left on device' TSDB Disk Full

Fix Prometheus 'no space left on device' TSDB errors: set retention size and time caps, free the data dir, cut cardinality, grow the disk, and offload long-term to remote write.

  • #prometheus-monitoring
  • #troubleshooting
  • #errors
  • #tsdb

Exact Error Message

A full TSDB data directory surfaces as no space left on device on the write path, compaction, or WAL append:

ts=2026-06-27T11:58:03.661Z caller=db.go:933 level=error component=tsdb msg="compaction failed" err="write /var/lib/prometheus/data/01J...XYZ/chunks/000001: no space left on device"
level=error msg="Error on ingesting samples" err="write /var/lib/prometheus/data/wal/00004210: no space left on device"
level=error caller=db.go:885 component=tsdb msg="WAL checkpoint failed" err="no space left on device"

When this happens, head compaction stops, the WAL grows unbounded, and eventually Prometheus cannot ingest at all — it may also fail to restart cleanly.

What the Error Means

Prometheus writes everything under one data directory (default /prometheus, often /var/lib/prometheus/data): the WAL, the in-memory head flushed to chunk files, and immutable persistent blocks produced by compaction. no space left on device means the filesystem backing that directory has no free space (or no free inodes) to write the next chunk, WAL segment, or compacted block.

Unlike the ingestion-level errors (out of order, out of bounds, duplicate sample), this is an OS-level write failure: the TSDB is healthy in principle, it simply has nowhere to put bytes. It is also distinct from WAL corruption — there the WAL files are damaged; here they are fine but the volume is full.

Common Causes

  • Retention too long with no size cap. --storage.tsdb.retention.time set high (or default 15 d) on a volume too small for the actual ingestion rate, and --storage.tsdb.retention.size unset.
  • Cardinality growth. A new label (pod, request ID, customer ID) explodes series count; head and blocks grow far faster than planned.
  • WAL growth. Stalled compaction (often itself caused by an earlier disk-full event) leaves the WAL unable to checkpoint, so it keeps growing.
  • No size-based retention. Only time retention is set, so a spike in ingestion fills the disk before the time window expires.
  • Other consumers on the same volume. Logs, snapshots (/api/v1/admin/tsdb/snapshot), or another service sharing the partition.
  • Snapshots left behind. data/snapshots/ directories from admin snapshot calls never cleaned up.

How to Reproduce the Error

On a small test volume, ingest aggressively with a long retention and no size cap:

# Tiny tmpfs to simulate a small data volume
sudo mount -t tmpfs -o size=64m tmpfs /tmp/promdata
prometheus --storage.tsdb.path=/tmp/promdata \
  --storage.tsdb.retention.time=90d \
  --config.file=/etc/prometheus/prometheus.yml &
# Generate high-cardinality load against it, then watch:
journalctl -u prometheus -f | grep -i 'no space left'
level=error component=tsdb msg="compaction failed" err="... no space left on device"

The 64 MB tmpfs fills within minutes and reproduces the exact failure.

Diagnostic Commands

Confirm the data volume is actually full (space and inodes):

df -h /var/lib/prometheus/data
df -i /var/lib/prometheus/data
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme1n1    100G  100G   12K 100% /var/lib/prometheus

See where the space went — blocks vs WAL vs snapshots:

du -sh /var/lib/prometheus/data/wal
du -sh /var/lib/prometheus/data/snapshots 2>/dev/null
du -sh /var/lib/prometheus/data/* | sort -h | tail

Read TSDB head and block stats from the API:

curl -s http://localhost:9090/api/v1/status/tsdb | jq '.data.headStats'
{ "numSeries": 4821330, "chunkCount": 19204110, "minTime": ..., "maxTime": ... }

Check on-disk block size and series count via metrics:

prometheus_tsdb_storage_blocks_bytes
prometheus_tsdb_head_series
topk(10, count by (__name__) ({__name__=~".+"}))

Find the configured retention flags actually in use:

ps aux | grep -oE 'storage.tsdb.retention[^ ]*'
journalctl -u prometheus --no-pager | grep -iE 'retention|no space' | tail

Step-by-Step Resolution

1. Free enough space to let Prometheus run. Remove leftover snapshots and any non-Prometheus files on the volume; do not hand-delete block directories while Prometheus is running:

ls -lt /var/lib/prometheus/data/snapshots
rm -rf /var/lib/prometheus/data/snapshots/<old-snapshot-id>
df -h /var/lib/prometheus/data

2. Set a size-based retention cap. This is the durable fix — Prometheus will keep the data dir under a hard ceiling regardless of ingestion spikes. Edit the unit’s flags (e.g. /etc/default/prometheus or the systemd ExecStart):

--storage.tsdb.retention.size=80GB
--storage.tsdb.retention.time=30d

Leave ~15–20% headroom below the volume size for compaction scratch space. Whichever cap (size or time) is hit first triggers deletion.

3. Reload/restart. After editing flags, restart so they take effect (flag changes require a restart, not a reload):

sudo systemctl daemon-reload
sudo systemctl restart prometheus
journalctl -u prometheus -f | grep -iE 'retention|compaction'

Older blocks beyond the new caps are deleted on the next compaction cycle.

4. Reduce cardinality if series count is the driver. Use topk(... count by (__name__) ...) to find the worst offenders and drop high-cardinality labels at scrape time:

metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'apiserver_request_duration_seconds_bucket'
    action: drop
  - regex: 'request_id|trace_id|pod_uid'
    action: labeldrop

5. Grow the disk as a stopgap if data is genuinely needed at full resolution:

sudo growpart /dev/nvme1 1 && sudo resize2fs /dev/nvme1n1
df -h /var/lib/prometheus/data

6. Offload long-term storage. For multi-month history, keep local retention short and remote_write to a long-term store (Thanos, Mimir, Cortex) instead of growing local disk forever:

remote_write:
  - url: https://mimir.internal/api/v1/push

Prevention and Best Practices

  • Always set --storage.tsdb.retention.size in addition to (or instead of) .time; it is the only cap that protects against ingestion spikes.
  • Provision the data volume for ~120% of expected steady-state block size and alert on node_filesystem_avail_bytes under 20%.
  • Track prometheus_tsdb_head_series and prometheus_tsdb_storage_blocks_bytes on a dashboard; alert on cardinality growth before it fills the disk.
  • Give Prometheus a dedicated volume — never share with logs or other services.
  • Drop high-cardinality labels at scrape time with metric_relabel_configs rather than after they have filled the disk.
  • Use remote_write for long-term retention; keep local retention to days, not months.
  • Clean up data/snapshots/ after any admin/tsdb/snapshot call.
  • opening storage failed / WAL corruption — damaged WAL segments on startup; distinct from a full disk (though a disk-full event can lead to truncated WAL files).
  • compaction failed — the most common log line carrying no space left on device; compaction is where the failure usually first appears.
  • mmap: cannot allocate memory — a related resource exhaustion (virtual memory / vm.max_map_count) rather than disk, with high block counts.

Frequently Asked Questions

Should I delete block directories by hand to free space? Not while Prometheus is running — you risk inconsistent state. Free space from snapshots and non-Prometheus files instead, then let size/time retention delete old blocks automatically. If you must delete blocks manually, stop Prometheus first and remove whole block ULID directories, never individual files.

What’s the difference between retention.time and retention.size? retention.time deletes data older than a duration; retention.size deletes the oldest blocks once the data dir exceeds a byte ceiling. Set both — whichever limit is reached first applies. Only retention.size protects against a sudden ingestion or cardinality spike.

Do retention flag changes take effect on reload? No. --storage.tsdb.retention.* are command-line flags, so you must restart Prometheus (not just POST /-/reload) for new values to apply.

Why did the WAL keep growing after the disk filled? A full disk stalls compaction and WAL checkpointing, so the WAL cannot be truncated and keeps accumulating segments — making the problem worse. Freeing space and restarting lets checkpointing resume and the WAL shrink.

Is remote write a replacement for local disk? It is for long-term storage. You still need local disk for the WAL, head, and short local retention, but offloading history to a remote store lets you keep local retention (and disk) small.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.