Triaging a Full Disk on Linux: df, du, inodes, and AI

“No space left on device” at 2 a.m. is a rite of passage. The application is throwing write errors, the database won’t accept connections, maybe you can’t even log a line because the journal is full too. The pressure is to delete something fast — and that’s how people rm -rf a directory the running service still needs and turn a full disk into a full outage.

After enough of these, the move that consistently works is to triage methodically instead of frantically. AI fits this perfectly as a fast junior engineer: it reads the du output and spots the obvious offender quicker than I scan it, and it suggests what’s safe to remove. But it never gets to run the rm. On a full disk, the difference between deleting a stale log and deleting live data is the difference between a five-minute fix and a restore.

Confirm it’s actually space — or inodes

df answers the first question, but you have to check two things:

df -h        # bytes free
df -i        # inodes free

Inode exhaustion fools people constantly. df -h shows plenty of space, but df -i shows 100% used — usually from millions of tiny files (a runaway mail queue, session files, or a cache directory). The error is identical, the fix is different. Pro Tip: If df -h says you have room but writes still fail with ENOSPC, check df -i immediately. Nine times out of ten the “mystery full disk” is actually inode exhaustion in /var/spool, a session store, or a Maildir.

Hand both outputs to your assistant. I keep this prompt with my other linux admin prompts:

Here’s df -h and df -i. Which filesystem is the problem, is this a space or inode issue, and what directories should I investigate first given a typical Ubuntu server layout?

Find the heavy directories without flailing

The classic mistake is du -sh /* from root, which crawls everything including network mounts and /proc. Be targeted:

du -xh --max-depth=1 / 2>/dev/null | sort -rh | head -20

The -x keeps it on one filesystem so you don’t wander into mounts. That one-liner gives you the top-level offenders sorted biggest-first. Drill into the worst one and repeat. For interactive hunting, ncdu is unbeatable:

sudo ncdu -x /var

It builds a navigable tree sorted by size — you walk into the biggest directory, then the biggest subdirectory, until you find the culprit. Capture the du output and let the AI summarize it: “Here’s the top 20 directories by size under /var. What’s likely safe to clean and what’s load-bearing?” The model is good at recognizing that /var/lib/docker or /var/log/journal is the offender. You still verify before deleting.

The usual suspects

Most full-disk incidents are one of a handful of culprits, and knowing them saves time:

Logs that never rotated — /var/log ballooning because logrotate is misconfigured or a service logs to a file logrotate doesn’t know about.
The journal — journalctl --disk-usage then sudo journalctl --vacuum-size=500M to cap it safely.
Docker — docker system df then docker system prune for dangling images and stopped containers.
Old package caches — apt clean or dnf clean all.
Deleted-but-open files — a process holding a deleted log; the space won’t free until you restart it.

That last one is sneaky. df shows the disk full but du can’t find the space, because a process is still holding a file you already deleted:

sudo lsof +L1 | grep deleted

Find the PID, and the fix is restarting (not killing -9) that service so it releases the handle. The incident response helper is good at turning “df and du disagree” into the lsof investigation path — feed it the symptoms and it points you at the deleted-file case.

Buy yourself headroom safely

Sometimes you just need a few hundred megabytes to get the service breathing again before you do real cleanup. The safe quick wins, in order of how confidently I’ll run them:

sudo journalctl --vacuum-size=200M     # caps systemd journal
sudo apt clean                          # or: dnf clean all
docker system prune -f                  # dangling images, stopped containers

These are reversible-ish and rarely touch anything live. What I won’t do under pressure is delete from a directory I haven’t confirmed is safe. If the AI suggests removing /var/lib/something, I check what owns it (dpkg -S or rpm -qf) and whether a running process has it open before anything happens. The model drafts the cleanup; a human confirms each deletion against reality.

Prevent the next one

Once the fire’s out, the real work is making sure it doesn’t recur. Set a disk-usage alert so you find out at 70%, not 100%:

df -h --output=pcent,target | awk 'NR>1 && $1+0 > 80'

Wire that into your monitoring, or let the monitoring alerts helper draft an alert rule for filesystem usage and inode usage both. Fix logrotate properly, cap the journal in /etc/systemd/journald.conf with SystemMaxUse=, and put a system df check on your Docker hosts. I keep these prevention snippets in the prompt packs and prompts library so the post-incident hardening is a checklist, not a memory test.

Conclusion

A full disk is an emergency that rewards calm. Confirm space versus inodes, find the heavy directories with -x and ncdu, check for deleted-but-open files when df and du disagree, and reclaim headroom with the safe, reversible cleanups first. AI earns its place as a fast reader of the du output and a triage partner — but every rm stays under human control, because on a full disk the cost of deleting the wrong thing is a restore, not a retry.