Taming Postgres Bloat and Autovacuum With AI

The page that finally got my attention said “disk 92% full” on a database that should have been a third that size. The data hadn’t grown — the dead tuples had. A high-churn table was getting hammered with updates, autovacuum was running with cluster-default settings that were tuned for a 2010-era small database, and dead rows were piling up faster than they were being cleaned. Bloat is the quiet Postgres failure mode: nothing errors, queries just get slower and disk creeps up until something breaks. AI is useful here because autovacuum has a lot of knobs and the interactions are non-obvious, but it cannot see your churn rate or your dead-tuple counts. Those come from the catalog. Pair them.

This is how I diagnose bloat and tune autovacuum without turning it into guesswork.

Understand why bloat happens

Postgres uses MVCC: an UPDATE doesn’t overwrite a row, it writes a new version and marks the old one dead. DELETE just marks rows dead. Autovacuum reclaims that dead space for reuse. If updates outrun vacuuming, dead tuples accumulate, the table physically grows, and scans read more pages than they need to. The fix is rarely “vacuum harder once” — it’s tuning autovacuum to keep up continuously on the tables that actually churn.

Measure the dead tuples

Start with the per-table picture. The ratio of dead to live tuples tells you where the problem is.

SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  round(n_dead_tup::numeric / nullif(n_live_tup, 0), 3) AS dead_ratio,
  last_autovacuum,
  autovacuum_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

A table with a high dead_ratio and a stale or null last_autovacuum is your culprit. I paste this to AI and ask it to rank tables by likely bloat impact, weighting by both ratio and absolute table size — a 40% dead ratio on a tiny lookup table doesn’t matter; the same ratio on your largest table does.

Estimate the actual bloat

Dead tuples are a proxy; physical bloat is the thing that fills disk. Estimating it precisely is fiddly, so I use the pgstattuple extension on suspect tables for ground truth.

CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT * FROM pgstattuple('orders');
SELECT * FROM pgstatindex('idx_orders_created');

free_percent and dead_tuple_percent tell you how much space is recoverable. Run this on a replica — it scans the whole relation and isn’t free. I give the output to AI to interpret, asking specifically whether the bloat justifies a rewrite or whether better autovacuum settings will let the space get reused naturally.

Tune autovacuum where it matters

The cluster defaults scale autovacuum by a percentage of the table. On a huge table, autovacuum_vacuum_scale_factor = 0.2 means 20% of the rows must be dead before vacuum even triggers — on a billion-row table that’s 200M dead tuples, which is far too lax. The fix is per-table overrides, not a global change.

ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.01,   -- trigger at 1% dead
  autovacuum_vacuum_threshold = 1000,
  autovacuum_vacuum_cost_limit = 2000,     -- let it work faster
  autovacuum_analyze_scale_factor = 0.02
);

I have a 600M-row orders table taking ~30k updates/minute. Dead tuples hit 40M before autovacuum fired under cluster defaults. Propose per-table autovacuum settings to keep dead tuples under ~1%, and explain the tradeoff of lowering the cost limit against I/O on the primary. Give me the exact ALTER TABLE statement.

The model reasons well about the scale-factor math and the cost-limit tradeoff. I verify by watching n_dead_tup and last_autovacuum over the next few hours to confirm autovacuum now keeps pace. If it doesn’t, I lower the scale factor again — but I make one change at a time so I know what moved the needle.

Reclaiming space you’ve already lost

Tuning stops future bloat; it doesn’t shrink a table that’s already bloated. A plain VACUUM makes the space reusable but doesn’t return it to the OS. To actually reclaim disk without an exclusive lock, use pg_repack rather than VACUUM FULL, which rewrites the table under an ACCESS EXCLUSIVE lock and takes the table offline.

# pg_repack rebuilds the table online, swapping it in at the end
pg_repack --no-order --table orders -d proddb

I ask AI to walk through the repack plan and the rollback, but I run it during a low-traffic window on a replica first to time it and confirm disk headroom — pg_repack needs space for a full copy of the table while it works.

Keep it boring

Bloat management is maintenance, not heroics. The loop: measure dead tuples and physical bloat, tune autovacuum per-table for the hot tables, reclaim space with pg_repack when needed, then keep watching the catalog. AI turns autovacuum’s pile of interacting knobs into a clear recommendation in minutes, but the verification is always the catalog telling you the dead-tuple count came down. For the broader tuning picture see the Postgres guides, and I keep my autovacuum-audit prompts in the prompt library so I’m not rewriting them every incident.