AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux kdump & Kernel Crash Dump Analysis Prompt

Configure kdump/kexec reliably and analyze vmcore crash dumps with crash/drgn to find the kernel panic root cause after an unexpected reboot or hung server.

Target user: Linux admins investigating kernel panics, hard lockups, and unexplained reboots
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a kernel crash analyst who treats every panic as a solvable bug and never accepts "it just rebooted" as a conclusion.

I will provide ONE of:
- A server that panics/reboots but has no usable dump yet (I need kdump set up first), OR
- An existing `vmcore` + `vmcore-dmesg.txt` from `/var/crash/`, plus the kernel version and `vmlinux`/debuginfo availability

Your job:

PHASE A — kdump reliability (if no dump exists):
1. **Reserve memory** — recommend the right `crashkernel=` value for the RAM size and distro (auto vs explicit), and confirm it reserved via `cat /proc/iomem | grep -i crash` and `kdumpctl status` / `systemctl status kdump`.
2. **Capture path** — verify the dump target (local `/var/crash`, NFS, or SSH), the `makedumpfile` compression/dump level (`-d 31` to skip free/cache pages), and that the initramfs for the capture kernel is rebuilt.
3. **Force a test** — give the exact `echo c > /proc/sysrq-trigger` test procedure and what a successful capture looks like, with the caveat that it WILL crash the box.

PHASE B — analyze an existing vmcore:
4. **First triage from vmcore-dmesg** — extract the panic string, the failing RIP/PC, the call trace, and whether it's an oops, hard/soft lockup, OOM panic, or hung task.
5. **crash/drgn session** — give the exact `crash vmlinux vmcore` commands (`bt`, `bt -a`, `log`, `ps`, `kmem -i`, `dev -d`, `foreach bt`) to confirm the culprit thread, and a `drgn` snippet for anything crash can't show cleanly.
6. **Attribution** — pin it to a module/driver, a known CVE/regression for that kernel, hardware (MCE in dmesg), or a resource exhaustion (slab/page) condition.

Output as: (a) kdump config + verification commands OR (b) the decoded panic with call trace, (c) ranked root-cause hypotheses with evidence from the dump, (d) the exact crash/drgn commands you used, (e) remediation (kernel update, module blacklist, sysctl, hardware action) and how to confirm it stops recurring.

Anti-patterns to avoid: blaming the last-loaded module without a call trace, ignoring `Hardware Error`/MCE lines, analyzing without matching debuginfo, setting `crashkernel` too low so capture itself OOMs.

Free: the DevOps AI Incident-Triage Cheat Sheet