Skip to content
CloudOps
All prompts
AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux OOM Kill & Memory Pressure Investigation Prompt

Diagnose OOM kills, memory pressure, swap thrashing, slab bloat, and cgroup memory limit failures on Linux servers from dmesg OOM banners and /proc data.

Target user
Linux sysadmins, SREs, and on-call engineers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Linux performance engineer who can read an OOM-killer banner in `dmesg` like a book — process scores, RSS columns, slab stats, cgroup boundaries. You know which "memory pressure" alerts are real and which are just page cache doing its job.

I will provide:
- The symptom (OOM kill of a specific process, swap thrashing, app reports allocation failures, latency spikes traced to memory pressure)
- System info: total RAM, swap config, whether running in a container/VM, cgroup v1 vs v2
- `dmesg | grep -A30 -i "out of memory"` or `journalctl -k --since "1h ago" | grep -A30 -i oom`
- `free -h` and `cat /proc/meminfo`
- `ps auxf --sort=-rss | head -20` (top RSS consumers)
- For container/k8s: `cat /sys/fs/cgroup/memory.max` (cgroup v2) and `memory.current` for the affected cgroup
- App-side context: GC logs (if JVM/.NET), language runtime (Python, Go, Node)

Your job:

1. **Decode the OOM banner** if present:
   - Which cgroup triggered the OOM (`oom-kill: ... oom_memcg=...`)?
   - Was it system-wide OOM or container/cgroup OOM?
   - Which process was killed and what was its `oom_score`?
   - Was the kill due to RSS exceeding limit, or due to host RAM exhaustion?
2. **Account for memory honestly** using `/proc/meminfo`:
   - `MemTotal - MemAvailable` = real "in use" memory
   - `Buffers + Cached + SReclaimable` = page cache (will shrink under pressure, NOT lost)
   - `Slab - SReclaimable` = SUnreclaim (kernel data structures, won't free)
   - `AnonPages` + `Mapped` = process anonymous memory (the actual scary number)
3. **Distinguish "low MemFree" from "low MemAvailable"** — page cache is good, not bad. Alerting on `MemFree` is almost always wrong.
4. **Check for slab bloat** (`SUnreclaim` growth) — usually kernel object leak (dentry cache, inode cache, network connections).
5. **Check swap behavior** — `swappiness`, `pswpin/pswpout` rates. Thrashing (constant in+out) is worse than no swap.
6. **For cgroup v2**: check `memory.events` (`high`, `max`, `oom`, `oom_kill` counters) and `memory.pressure` (PSI).
7. **Identify the leak source**: process-level RSS growth, kernel slab growth, anonymous huge pages, transparent hugepage compaction overhead, or zombie/orphaned memory.
8. **Recommend the fix**: cgroup limit adjust, swappiness tuning, THP off, oom_score_adj, application memory tuning. Mark anything DESTRUCTIVE.

Common failure classes to surface:
- Container OOM killed but host has free RAM → cgroup limit set too low
- Host OOM under "plenty of free memory" → page cache being reclaimed too slowly (zone reclaim, NUMA imbalance)
- Slow memory leak in production app → RSS grows steadily, GC unable to recover
- Slab cache bloat from container churn → millions of dentries from short-lived files
- THP compaction storms → high `%sy` CPU during memory pressure
- The OOM killer picked the wrong process → `oom_score_adj` not set on critical service

---

System type: [bare metal / VM / container]
Total RAM: [N GB]
Swap config: [N GB / none / zram]
Cgroup version: [v1 / v2]
Distro + kernel: [e.g., Ubuntu 22.04, 5.15.0-...]
Symptom: [DESCRIBE]
OOM banner (`dmesg` or `journalctl -k`):
```
[PASTE]
```
`free -h`:
```
[PASTE]
```
`cat /proc/meminfo`:
```
[PASTE first 30 lines]
```
Top RSS processes:
```
[PASTE `ps auxf --sort=-rss | head -20`]
```
Cgroup limits (if applicable):
```
[PASTE memory.max, memory.current, memory.events]
```

Why this prompt works

OOM kills are usually misdiagnosed. The most common wrong answer is “we need more RAM” when the actual problem is a cgroup limit too low, a slab leak, or page cache being mistaken for “used” memory. This prompt forces honest memory accounting via /proc/meminfo and decodes the OOM banner properly.

How to use it

  1. Always include the OOM banner. It tells you which cgroup, what limit, which process, and the oom_score table. Without it you’re guessing.
  2. free -h alone is not enough. Include cat /proc/meminfo | head -30. The page cache vs anon split matters.
  3. For containers: include the cgroup files. Without them, you can’t distinguish container OOM from host OOM.
  4. If you suspect a slab leak, capture slabtop -o | head -30 — top kernel slab consumers tell you what’s leaking (often dentry or kmalloc-*).

Useful commands

# OOM evidence
sudo dmesg -T | grep -A30 -i "out of memory"
sudo journalctl -k --since "1 hour ago" | grep -A30 -i oom
sudo journalctl _TRANSPORT=kernel | grep -A20 oom

# Honest memory accounting
free -h
cat /proc/meminfo | head -30
cat /proc/vmstat | grep -E "pgscan|pgsteal|pgfault|pswp|oom"

# Top RSS consumers
ps auxf --sort=-rss | head -20
# Sum RSS by command name
ps -eo rss,comm | sort -k1 -n | awk '{a[$2]+=$1} END {for (i in a) print a[i], i}' | sort -n | tail

# Process-detailed memory
cat /proc/<pid>/status | grep -E "Vm|Rss"
cat /proc/<pid>/smaps_rollup
pmap -X <pid> | tail -5

# Slab cache
sudo slabtop -o | head -30
sudo cat /proc/slabinfo | sort -k2 -n -r | head -20

# Cgroup v2 (modern systemd / k8s)
cat /sys/fs/cgroup/<slice>/memory.max
cat /sys/fs/cgroup/<slice>/memory.current
cat /sys/fs/cgroup/<slice>/memory.events
cat /sys/fs/cgroup/<slice>/memory.pressure

# Cgroup v1 (legacy)
cat /sys/fs/cgroup/memory/<slice>/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/<slice>/memory.usage_in_bytes

# Swap activity
vmstat 1 5    # si/so columns
sar -B 1 5    # paging stats
sar -W 1 5    # swap rate

# THP
cat /sys/kernel/mm/transparent_hugepage/enabled
grep -i AnonHugePages /proc/meminfo

# Per-process OOM score
cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj

Common findings this catches

  • Container OOM but host has 20G freememory.max set too low. Either raise the limit or fix the app’s working set.
  • Slow leak in kmalloc-128 slab → kernel object leak; often a driver bug. Check slabtop deltas over time.
  • “OOM” but Cached: 80% of RAM → page cache wasn’t reclaimed fast enough. Often NUMA-zone issue; check numastat -m.
  • OOM killed sshdoom_score_adj not set on critical services. Add to systemd unit:
    [Service]
    OOMScoreAdjust=-900
  • THP compaction storms%sy CPU spikes during memory pressure. Disable for databases:
    echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
  • Swap thrashing (si/so both > 1000 sec) → set vm.swappiness=1 and consider zram for ephemeral nodes.

Memory accounting cheatsheet

MemTotal          - total RAM (excludes a few reserved regions)
MemFree           - completely unused (NOT what you should alert on)
MemAvailable      - free + reclaimable cache (THIS is "free for new allocs")
Buffers           - block device cache
Cached            - file page cache (good thing!)
SReclaimable      - reclaimable slab (dentry, inode caches)
SUnreclaim        - non-reclaimable slab (kernel objects; leaks live here)
AnonPages         - anonymous (process heap/stack); the real "memory in use"
Mapped            - mmap'd files (in Cached but charged here too)
Shmem             - tmpfs / shared memory (counts as "used")
PageTables        - kernel page-table overhead (grows with processes × VM)
Slab              - SReclaimable + SUnreclaim

Permanent fixes worth applying after recovery

# systemd unit hardening for a critical service
[Service]
OOMScoreAdjust=-900
MemoryHigh=4G        # soft pressure throttle (cgroup v2)
MemoryMax=6G         # hard limit (cgroup v2)

# sysctl baseline (review per workload)
vm.swappiness = 10
vm.min_free_kbytes = 524288   # 512 MB reserve on a 16+ GB box
vm.dirty_ratio = 10
vm.dirty_background_ratio = 5

When to escalate

  • Suspected kernel slab leak (SUnreclaim growing for days with no userspace correlation) — engage kernel team; reproduce, capture slabinfo deltas.
  • Repeated OOMs on a container with stable working set — application change or limit is wrong; coordinate with app owner.
  • OOM killing system-critical processes (sshd, systemd, kubelet) — fix OOMScoreAdjust urgently and root-cause the leaker.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.