Prometheus TSDB Internals: Head Block, WAL, Compaction &

The first time a Prometheus server fell over on my watch, it wasn’t CPU and it wasn’t a bad query. It was disk — a quiet, slow-motion fill until the WAL couldn’t flush and ingestion stalled. I’d treated the TSDB as a black box for years, and that outage was the bill coming due. So I sat down and actually learned how the thing stores data. This post is the map I wish I’d had: the head block, the WAL, on-disk blocks, compaction, and retention — plus the prometheus_tsdb_* metrics that tell you what’s really happening before disk usage tells you the hard way.

The head block: where every sample lands first

Every fresh sample Prometheus scrapes goes into the head block — an in-memory structure holding the most recent ~2-3 hours of data. The head is where active series live, where queries for “right now” are served, and where your memory footprint mostly comes from.

The single most important head metric is the active series count:

prometheus_tsdb_head_series

This is your real-time cardinality. If it climbs without bound, you have a label explosion somewhere, and your memory will follow it off a cliff. Pair it with the rate of new series being created:

rate(prometheus_tsdb_head_series_created_total[5m])

A persistent high churn rate here — series constantly created and never seen again — is the classic signature of unbounded labels like request IDs or timestamps in label values. If that sounds familiar, I wrote a whole field guide on taming Prometheus metric cardinality.

Pro Tip: prometheus_tsdb_head_series is a gauge of currently-active series, while prometheus_tsdb_head_series_created_total is a cumulative counter. Don’t confuse the two on a dashboard — graphing the counter raw makes a runaway look like steady growth.

The WAL and chunks_head: durability without flushing constantly

The head lives in memory, but memory is volatile, so Prometheus protects it with a Write-Ahead Log (WAL) under data/wal/. Every sample appended to the head is also written to the WAL, so a crash or restart can replay the log and rebuild the head with no data loss.

Periodically the WAL is checkpointed and truncated so it doesn’t grow forever:

rate(prometheus_tsdb_wal_truncations_total[1h])
rate(prometheus_tsdb_wal_corruptions_total[1h])

Any non-zero value on the corruptions counter is a page-worthy event — usually disk problems underneath.

Alongside the WAL, you’ll see a chunks_head/ directory. As head chunks fill up (each chunk holds up to 120 samples per series), Prometheus flushes complete-but-still-recent chunks to these memory-mapped files. This keeps RAM down while keeping the data fast to query, and it dramatically shortens WAL replay on restart — replay time is a real availability factor for big servers.

On-disk blocks: the immutable two-hour units

Every two hours, the head is cut and written to disk as an immutable block — a self-contained directory containing chunk files, a tombstones file for deletions, a meta.json, and crucially an index. The index is an inverted index mapping label name/value pairs to the series and chunks that contain them; it’s what makes {job="api"} resolve quickly instead of scanning everything.

Each block covers a fixed time range and never changes after it’s written. Queries that span time simply fan out across the relevant blocks plus the head. You can watch how many blocks are currently loaded:

prometheus_tsdb_blocks_loaded

The number of loaded blocks rising steadily is normal as data accumulates; it should drop after compaction merges them.

Compaction: merging small blocks into big ones

If Prometheus kept thousands of tiny two-hour blocks, query planning and the file count would become miserable. Compaction solves this by merging adjacent blocks into larger ones (2h → 6h → 1d and beyond), deduplicating, applying tombstones, and rebuilding a single compact index per merged block.

Track compactions and their duration:

rate(prometheus_tsdb_compactions_total[1h])
histogram_quantile(0.99, rate(prometheus_tsdb_compaction_duration_seconds_bucket[1h]))
rate(prometheus_tsdb_compactions_failed_total[1h])

Failed compactions are serious — they mean blocks aren’t being merged or cleaned, so disk creeps up and queries slow down. Long compaction durations on the p99 often point to disk I/O contention; compaction reads and rewrites large chunk files, so cheap network volumes can choke here.

prometheus_tsdb_compaction_chunk_range_seconds_sum
prometheus_tsdb_compaction_populating_block

The compaction_populating_block gauge being stuck at 1 for a long time tells you a compaction is in progress (or wedged).

Pro Tip: Compaction temporarily needs headroom — it writes the new merged block before deleting the source blocks. Budget extra disk so a compaction of your largest block range never runs you out of space mid-merge. A stuck compaction on a full disk is a genuinely bad afternoon.

Retention: time vs size, and which wins

Old blocks are deleted by retention, and you get two knobs — they are evaluated together, and whichever limit is hit first triggers deletion:

# Time-based retention: drop blocks older than 30 days
--storage.tsdb.retention.time=30d

# Size-based retention: keep total TSDB under 200 GiB
--storage.tsdb.retention.size=200GB

# You can set both — first one reached wins
prometheus \
  --storage.tsdb.path=/var/lib/prometheus/data \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=200GB

A few things that bite people: size-based retention applies to persistent blocks, not the WAL or head, so set your disk with headroom above the size limit. Retention can only delete at block granularity, so actual on-disk usage hovers below your time target rather than hitting it exactly. And if you need months or years of history, local retention is the wrong tool — that’s a job for remote storage like Thanos or Mimir, which I compared in long-term Prometheus storage.

Confirm what retention is actually doing:

prometheus_tsdb_size_retentions_total
prometheus_tsdb_time_retentions_total
prometheus_tsdb_storage_blocks_bytes

prometheus_tsdb_storage_blocks_bytes is your single best gauge of real on-disk block size — graph it against your retention.size limit and you’ll see retention working at the edge.

Sizing disk and watching the right metrics

Rough capacity planning starts from one number: Prometheus stores roughly 1-2 bytes per sample after compression. Multiply your active series by samples per series per second by retention seconds:

bytes ≈ active_series × (1 / scrape_interval_seconds) × retention_seconds × bytes_per_sample

Then add generous headroom for the WAL, head, and compaction working space — I aim to keep steady-state usage under ~60-70% of the volume. The metrics I keep on every Prometheus dashboard:

# How fast samples are landing — your ingestion firehose
rate(prometheus_tsdb_head_samples_appended_total[5m])

# Out-of-order / rejected appends that signal clock or scrape issues
rate(prometheus_tsdb_out_of_order_samples_total[5m])

# Reloads of the head from WAL after restart (replay health)
prometheus_tsdb_wal_truncations_failed_total

Letting AI read the internals — but reviewing first

Here’s where I lean on AI, and where I’m careful about it. When prometheus_tsdb_compactions_failed_total ticks up at 2 a.m., a model is a fantastic fast junior engineer: paste the metric names, the flag values, and a snippet of logs, and it’ll explain the head/WAL/block relationship and propose a likely cause in seconds. That’s genuinely faster than me re-deriving the storage model from memory.

But “fast junior engineer” is exactly the right mental model — every suggestion gets reviewed before it ships. AI will confidently propose --storage.tsdb.retention.size=200GB on a 220 GB volume and not flag that compaction headroom just vanished. So I treat its output as a draft, not a command. If you want repeatable, vetted prompts for this kind of work, the prompt library and the deeper prompt packs are where I keep mine, and tools like Claude are strong at explaining these internals when you give them the real metric values.

The pattern that’s worked best: have AI draft the alert rules on these TSDB metrics, then review every threshold against your actual disk and ingestion numbers. Our free Alert Rule Generator does exactly that — it produces explainable, reviewable PromQL alerts (compaction failures, retention pressure, series growth) so you ship something you understand rather than a copy-pasted black box. For more on the monitoring stack, the full Prometheus monitoring category collects the rest.

Conclusion

The TSDB isn’t a black box once you’ve seen its shape: samples land in the head, the WAL keeps them safe, blocks make them durable and queryable, compaction keeps them efficient, and retention keeps the disk from filling. Watch prometheus_tsdb_head_series, the compaction counters, and prometheus_tsdb_storage_blocks_bytes, size your disk with real headroom, and let AI accelerate the diagnosis — as long as a human reviews the conclusion before it becomes a flag in production.

Prometheus TSDB Internals: Head Block, WAL, Compaction & Retention Explained