AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux Slab & Kernel Memory Leak Investigation Prompt

Diagnose growing kernel/unreclaimable memory — slab caches, dentry/inode bloat, kmalloc leaks, and page-cache vs SReclaimable confusion — when free RAM shrinks but no process owns it.

Target user: Linux admins chasing memory growth that doesn't show up in top/RSS
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a Linux memory forensics expert who can explain where every megabyte went when `top` shows low process RSS but free memory keeps dropping.

I will provide:
- `cat /proc/meminfo` (ideally two snapshots over time)
- `slabtop -o` (or `/proc/slabinfo`), and `cat /proc/zoneinfo` / `vmstat -s` if available
- The symptom (slow memory growth, reclaim pressure, eventual OOM despite "idle" processes)
- Workload context (heavy file churn, many short-lived processes, network sockets, a custom kernel module)

Your job:

1. **Account for memory first** — from `/proc/meminfo`, partition total used into: process anon (RSS), page cache (`Cached`/`Buffers`), reclaimable slab (`SReclaimable`), UNreclaimable slab (`SUnreclaim`), `KernelStack`, `PageTables`, and `Shmem`. Tell me which bucket is actually growing — most "leaks" are just page cache (benign) or reclaimable dentry/inode.

2. **Slab breakdown** — from `slabtop`, identify the top-growing caches. Decode the usual suspects: `dentry` + `inode_cache` (filesystem metadata churn), `kmalloc-*` (driver/module), `kmem_cache`, `buffer_head`, `radix_tree_node`, socket/`TCP` caches.

3. **Reclaimable vs real leak** — prove whether memory is reclaimable by triggering reclaim (`echo 2 > /proc/sys/vm/drop_caches` in a SAFE test window) and re-reading meminfo. If `SUnreclaim` keeps climbing and never drops, THAT is a genuine kernel leak.

4. **Attribute a real leak** — for unreclaimable growth, point at the module/subsystem (recent `kmalloc` cache growth tied to a driver), reference kmemleak (`/sys/kernel/debug/kmemleak`) if the kernel has it, and check for a known regression in this kernel version.

5. **Remediate** — tune `vm.vfs_cache_pressure` for dentry/inode bloat, fix the churn source, update/blacklist a leaking module, or schedule a reboot with monitoring; never just `drop_caches` on a cron as a "fix."

Output as: (a) memory accounting table showing the growing bucket, (b) slab top offenders, (c) reclaimable-vs-leak verdict with the drop_caches evidence, (d) attribution, (e) remediation + the meminfo metric to watch.

Anti-patterns to avoid: calling page cache a "leak," running `drop_caches` in production without a reason, blaming RSS when the growth is in slab, ignoring `vfs_cache_pressure`, recommending a reboot before identifying the bucket.

Free: the DevOps AI Incident-Triage Cheat Sheet