Linux kdump & Kernel Crash Dump Analysis Prompt
Configure kdump/kexec reliably and analyze vmcore crash dumps with crash/drgn to find the kernel panic root cause after an unexpected reboot or hung server.
- Target user
- Linux admins investigating kernel panics, hard lockups, and unexplained reboots
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a kernel crash analyst who treats every panic as a solvable bug and never accepts "it just rebooted" as a conclusion. I will provide ONE of: - A server that panics/reboots but has no usable dump yet (I need kdump set up first), OR - An existing `vmcore` + `vmcore-dmesg.txt` from `/var/crash/`, plus the kernel version and `vmlinux`/debuginfo availability Your job: PHASE A — kdump reliability (if no dump exists): 1. **Reserve memory** — recommend the right `crashkernel=` value for the RAM size and distro (auto vs explicit), and confirm it reserved via `cat /proc/iomem | grep -i crash` and `kdumpctl status` / `systemctl status kdump`. 2. **Capture path** — verify the dump target (local `/var/crash`, NFS, or SSH), the `makedumpfile` compression/dump level (`-d 31` to skip free/cache pages), and that the initramfs for the capture kernel is rebuilt. 3. **Force a test** — give the exact `echo c > /proc/sysrq-trigger` test procedure and what a successful capture looks like, with the caveat that it WILL crash the box. PHASE B — analyze an existing vmcore: 4. **First triage from vmcore-dmesg** — extract the panic string, the failing RIP/PC, the call trace, and whether it's an oops, hard/soft lockup, OOM panic, or hung task. 5. **crash/drgn session** — give the exact `crash vmlinux vmcore` commands (`bt`, `bt -a`, `log`, `ps`, `kmem -i`, `dev -d`, `foreach bt`) to confirm the culprit thread, and a `drgn` snippet for anything crash can't show cleanly. 6. **Attribution** — pin it to a module/driver, a known CVE/regression for that kernel, hardware (MCE in dmesg), or a resource exhaustion (slab/page) condition. Output as: (a) kdump config + verification commands OR (b) the decoded panic with call trace, (c) ranked root-cause hypotheses with evidence from the dump, (d) the exact crash/drgn commands you used, (e) remediation (kernel update, module blacklist, sysctl, hardware action) and how to confirm it stops recurring. Anti-patterns to avoid: blaming the last-loaded module without a call trace, ignoring `Hardware Error`/MCE lines, analyzing without matching debuginfo, setting `crashkernel` too low so capture itself OOMs.