You are a senior Linux performance engineer who can read an OOM-killer banner in `dmesg` like a book — process scores, RSS columns, slab stats, cgroup boundaries. You know which "memory pressure" alerts are real and which are just page cache doing its job. I will provide: - The symptom (OOM kill of a specific process, swap thrashing, app reports allocation failures, latency spikes traced to memory pressure) - System info: total RAM, swap config, whether running in a container/VM, cgroup v1 vs v2 - `dmesg | grep -A30 -i "out of memory"` or `journalctl -k --since "1h ago" | grep -A30 -i oom` - `free -h` and `cat /proc/meminfo` - `ps auxf --sort=-rss | head -20` (top RSS consumers) - For container/k8s: `cat /sys/fs/cgroup/memory.max` (cgroup v2) and `memory.current` for the affected cgroup - App-side context: GC logs (if JVM/.NET), language runtime (Python, Go, Node) Your job: 1. **Decode the OOM banner** if present: - Which cgroup triggered the OOM (`oom-kill: ... oom_memcg=...`)? - Was it system-wide OOM or container/cgroup OOM? - Which process was killed and what was its `oom_score`? - Was the kill due to RSS exceeding limit, or due to host RAM exhaustion? 2. **Account for memory honestly** using `/proc/meminfo`: - `MemTotal - MemAvailable` = real "in use" memory - `Buffers + Cached + SReclaimable` = page cache (will shrink under pressure, NOT lost) - `Slab - SReclaimable` = SUnreclaim (kernel data structures, won't free) - `AnonPages` + `Mapped` = process anonymous memory (the actual scary number) 3. **Distinguish "low MemFree" from "low MemAvailable"** — page cache is good, not bad. Alerting on `MemFree` is almost always wrong. 4. **Check for slab bloat** (`SUnreclaim` growth) — usually kernel object leak (dentry cache, inode cache, network connections). 5. **Check swap behavior** — `swappiness`, `pswpin/pswpout` rates. Thrashing (constant in+out) is worse than no swap. 6. **For cgroup v2**: check `memory.events` (`high`, `max`, `oom`, `oom_kill` counters) and `memory.pressure` (PSI). 7. **Identify the leak source**: process-level RSS growth, kernel slab growth, anonymous huge pages, transparent hugepage compaction overhead, or zombie/orphaned memory. 8. **Recommend the fix**: cgroup limit adjust, swappiness tuning, THP off, oom_score_adj, application memory tuning. Mark anything DESTRUCTIVE. Common failure classes to surface: - Container OOM killed but host has free RAM → cgroup limit set too low - Host OOM under "plenty of free memory" → page cache being reclaimed too slowly (zone reclaim, NUMA imbalance) - Slow memory leak in production app → RSS grows steadily, GC unable to recover - Slab cache bloat from container churn → millions of dentries from short-lived files - THP compaction storms → high `%sy` CPU during memory pressure - The OOM killer picked the wrong process → `oom_score_adj` not set on critical service --- System type: [bare metal / VM / container] Total RAM: [N GB] Swap config: [N GB / none / zram] Cgroup version: [v1 / v2] Distro + kernel: [e.g., Ubuntu 22.04, 5.15.0-...] Symptom: [DESCRIBE] OOM banner (`dmesg` or `journalctl -k`): ``` [PASTE] ``` `free -h`: ``` [PASTE] ``` `cat /proc/meminfo`: ``` [PASTE first 30 lines] ``` Top RSS processes: ``` [PASTE `ps auxf --sort=-rss | head -20`] ``` Cgroup limits (if applicable): ``` [PASTE memory.max, memory.current, memory.events] ```

Why this prompt works

OOM kills are usually misdiagnosed. The most common wrong answer is “we need more RAM” when the actual problem is a cgroup limit too low, a slab leak, or page cache being mistaken for “used” memory. This prompt forces honest memory accounting via /proc/meminfo and decodes the OOM banner properly.

How to use it

Always include the OOM banner. It tells you which cgroup, what limit, which process, and the oom_score table. Without it you’re guessing.
free -h alone is not enough. Include cat /proc/meminfo | head -30. The page cache vs anon split matters.
For containers: include the cgroup files. Without them, you can’t distinguish container OOM from host OOM.
If you suspect a slab leak, capture slabtop -o | head -30 — top kernel slab consumers tell you what’s leaking (often dentry or kmalloc-*).

Useful commands

# OOM evidence
sudo dmesg -T | grep -A30 -i "out of memory"
sudo journalctl -k --since "1 hour ago" | grep -A30 -i oom
sudo journalctl _TRANSPORT=kernel | grep -A20 oom

# Honest memory accounting
free -h
cat /proc/meminfo | head -30
cat /proc/vmstat | grep -E "pgscan|pgsteal|pgfault|pswp|oom"

# Top RSS consumers
ps auxf --sort=-rss | head -20
# Sum RSS by command name
ps -eo rss,comm | sort -k1 -n | awk '{a[$2]+=$1} END {for (i in a) print a[i], i}' | sort -n | tail

# Process-detailed memory
cat /proc/<pid>/status | grep -E "Vm|Rss"
cat /proc/<pid>/smaps_rollup
pmap -X <pid> | tail -5

# Slab cache
sudo slabtop -o | head -30
sudo cat /proc/slabinfo | sort -k2 -n -r | head -20

# Cgroup v2 (modern systemd / k8s)
cat /sys/fs/cgroup/<slice>/memory.max
cat /sys/fs/cgroup/<slice>/memory.current
cat /sys/fs/cgroup/<slice>/memory.events
cat /sys/fs/cgroup/<slice>/memory.pressure

# Cgroup v1 (legacy)
cat /sys/fs/cgroup/memory/<slice>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/<slice>/memory.usage_in_bytes

# Swap activity
vmstat 1 5    # si/so columns
sar -B 1 5    # paging stats
sar -W 1 5    # swap rate

# THP
cat /sys/kernel/mm/transparent_hugepage/enabled
grep -i AnonHugePages /proc/meminfo

# Per-process OOM score
cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj

Common findings this catches

Container OOM but host has 20G free → memory.max set too low. Either raise the limit or fix the app’s working set.
Slow leak in kmalloc-128 slab → kernel object leak; often a driver bug. Check slabtop deltas over time.
“OOM” but Cached: 80% of RAM → page cache wasn’t reclaimed fast enough. Often NUMA-zone issue; check numastat -m.
OOM killed sshd → oom_score_adj not set on critical services. Add to systemd unit:
```
[Service]
OOMScoreAdjust=-900
```
THP compaction storms → %sy CPU spikes during memory pressure. Disable for databases:
```
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
```
Swap thrashing (si/so both > 1000 sec) → set vm.swappiness=1 and consider zram for ephemeral nodes.

Memory accounting cheatsheet

MemTotal          - total RAM (excludes a few reserved regions)
MemFree           - completely unused (NOT what you should alert on)
MemAvailable      - free + reclaimable cache (THIS is "free for new allocs")
Buffers           - block device cache
Cached            - file page cache (good thing!)
SReclaimable      - reclaimable slab (dentry, inode caches)
SUnreclaim        - non-reclaimable slab (kernel objects; leaks live here)
AnonPages         - anonymous (process heap/stack); the real "memory in use"
Mapped            - mmap'd files (in Cached but charged here too)
Shmem             - tmpfs / shared memory (counts as "used")
PageTables        - kernel page-table overhead (grows with processes × VM)
Slab              - SReclaimable + SUnreclaim

Permanent fixes worth applying after recovery

# systemd unit hardening for a critical service
[Service]
OOMScoreAdjust=-900
MemoryHigh=4G        # soft pressure throttle (cgroup v2)
MemoryMax=6G         # hard limit (cgroup v2)

# sysctl baseline (review per workload)
vm.swappiness = 10
vm.min_free_kbytes = 524288   # 512 MB reserve on a 16+ GB box
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

When to escalate

Suspected kernel slab leak (SUnreclaim growing for days with no userspace correlation) — engage kernel team; reproduce, capture slabinfo deltas.
Repeated OOMs on a container with stable working set — application change or limit is wrong; coordinate with app owner.
OOM killing system-critical processes (sshd, systemd, kubelet) — fix OOMScoreAdjust urgently and root-cause the leaker.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

Linux OOM Kill & Memory Pressure Investigation Prompt

Why this prompt works

How to use it

Useful commands

Common findings this catches

Memory accounting cheatsheet

Permanent fixes worth applying after recovery

When to escalate

Related prompts

Linux High Load & CPU Runaway Investigation Prompt

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

Linux Server Troubleshooting Prompt

Linux Static HugePages Tuning Prompt

Reading prompts? Get all 500 in one free PDF