Taming the Linux OOM Killer: Tuning Out-of-Memory Behavior

There’s a particular flavor of 3am page where a server is up, the load is fine, but your main service is just… gone. No crash log, no panic, just an entry in dmesg that says the kernel killed it. That’s the OOM killer, and the cruel irony is that it almost always seems to kill the one process you cared about and spare the leaky one that caused the problem. After 25 years I’ve made peace with it, mostly by learning how it actually decides. Here’s that, plus how to bend it to your will.

When the OOM killer fires (and when it doesn’t)

The OOM killer is invoked when the kernel cannot satisfy a memory allocation and cannot reclaim enough by other means. Crucially, it fires based on committed allocations the kernel must back with real pages — not on the friendly “free memory” number people watch. A box with 90% memory used can be perfectly healthy if most of that is reclaimable cache.

First, confirm it was actually OOM and find the victim:

dmesg -T | grep -i -E 'killed process|out of memory|oom'
journalctl -k | grep -i oom

The kernel prints a table of every process, its RSS, and its oom_score at kill time. That table is gold — it tells you exactly what was consuming memory at the moment of death, which is often a different process than the one that died.

How the kernel picks a victim

Each process gets an oom_score derived mainly from how much memory it uses, adjusted by oom_score_adj (range -1000 to +1000). Higher score, more likely to die.

cat /proc/<pid>/oom_score        # current computed score
cat /proc/<pid>/oom_score_adj    # your tunable bias

Because the score scales with memory footprint, big well-behaved processes (databases, JVMs) are natural targets even when they’re the victim, not the cause. That’s the root of the “it killed the wrong thing” feeling.

Protecting a critical process

Lower a process’s chance of being chosen by setting a negative oom_score_adj. Setting it to -1000 makes a process effectively unkillable by OOM:

# Protect a running process
echo -800 | sudo tee /proc/$(pgrep -f myservice)/oom_score_adj

For services, do it declaratively in the systemd unit so it survives restarts:

[Service]
OOMScoreAdjust=-800

Conversely, you can mark a known-greedy batch job as more killable with a positive value, so the kernel sacrifices it first and leaves your database alone. This is the single most effective OOM tuning move: don’t try to stop OOM, just steer it toward the disposable process.

Use cgroups to contain the leak instead

Tuning oom_score_adj is triage. The real fix is to stop one process from being able to starve the whole box. cgroup v2 memory limits do that — when a cgroup hits its limit, the kernel reclaims or OOM-kills within that cgroup, leaving the rest of the system untouched.

With systemd, that’s two directives:

[Service]
MemoryMax=2G       # hard cap; OOM-kills inside this unit at the limit
MemoryHigh=1500M   # soft cap; throttles + reclaims before the hard limit

MemoryHigh is underrated. It puts back-pressure on the cgroup well before the hard limit, so a slow leak gets throttled and shows up as latency you can alert on, instead of a sudden kill. Check live usage:

systemctl show myservice -p MemoryCurrent
cat /sys/fs/cgroup/system.slice/myservice.service/memory.current

The overcommit knob

Linux lets processes allocate more virtual memory than physically exists, betting they won’t touch it all. That’s vm.overcommit_memory:

0 (default) — heuristic; allows reasonable overcommit
1 — always overcommit; never refuse an allocation (risky, used by some in-memory DBs)
2 — strict; refuse allocations beyond swap + RAM * overcommit_ratio

sysctl vm.overcommit_memory
sysctl vm.overcommit_ratio

Mode 2 trades “random late OOM kill” for “malloc fails early and predictably.” For a single-purpose box running one critical service, that predictability can be worth it — the app gets a clean allocation error instead of a surprise execution. Test it hard before committing; many applications handle malloc failure poorly.

Don’t forget swap and the early-OOM idea

A little swap gives the kernel somewhere to push cold pages so it isn’t forced to kill on a transient spike. But a system that’s thrashing in swap is arguably worse than a clean OOM kill — it goes unresponsive for minutes. Tools like earlyoom or systemd-oomd watch pressure (PSI) and kill earlier and more selectively than the kernel’s last-resort killer, keeping the box responsive:

systemctl status systemd-oomd
cat /proc/pressure/memory      # PSI: how stalled the system is on memory

/proc/pressure/memory is the metric to alert on. Rising some/full averages mean memory pressure before anything dies.

A practical playbook

Confirm OOM and read the kill-time table in dmesg.
Identify the cause process vs the victim process from that table.
Cap the cause with cgroup MemoryMax/MemoryHigh so it can’t starve the box.
Protect the critical service with OOMScoreAdjust.
Alert on /proc/pressure/memory, not on free memory.

Where AI helps

The kill-time process table in dmesg is dense and easy to misread under pressure. Pasting it into a model and asking “which process actually caused this and which was collateral, ranked by RSS and oom_score” turns a wall of numbers into a ranked answer fast. I keep a few Linux admin prompts for exactly this kind of log triage.

The OOM killer isn’t your enemy; it’s a last-resort safety valve doing its best with bad information. Give it better information — cgroup limits and score adjustments — and it’ll start killing the right thing.

Generated commands and configs are assistive, not authoritative. Always verify against your own systems before applying changes in production.