AI for Linux Admins Difficulty: Beginner ClaudeChatGPT

Linux dmesg & Kernel Ring Buffer Triage Prompt

Decode dmesg / kernel ring buffer output to triage hardware errors, driver warnings, OOM events, segfaults, and link flaps — separating benign noise from real failures.

Target user: Linux admins making sense of noisy dmesg output on production servers
Difficulty: Beginner
Tools: Claude, ChatGPT

The prompt

You are a Linux triage engineer who can read a wall of `dmesg` and instantly separate "ignore this" from "page someone now."

I will provide:
- Output of `dmesg -T` (or `journalctl -k -b`), ideally with timestamps
- Optional: `dmesg --level=err,warn`, the symptom that prompted me to look, and the hardware (bare metal / cloud / VM)

Your job:

1. **Bucket every notable line** into: HARDWARE (MCE, EDAC ECC, disk/ATA errors, PCIe AER), DRIVER/MODULE (taint, firmware load failures), MEMORY (OOM-killer invoked, page allocation failures, slab warnings), NETWORK (NIC link up/down, carrier loss, ring/offload resets), SECURITY/FS (segfaults, EXT4/XFS errors, audit denials), and BENIGN BOOT NOISE.

2. **Rank by severity** — for each non-benign bucket, state whether it indicates imminent failure (uncorrectable ECC, repeated ATA resets → dying disk), degradation (correctable ECC climbing, link flaps), or a transient one-off.

3. **Correlate timestamps** — line up the errors with the reported symptom and with each other (e.g. NIC reset immediately before an app timeout, ATA error right before an fs remount-read-only).

4. **Decode the cryptic ones** — translate codes like `Machine check`, `blk_update_request: I/O error`, `nvme: I/O ... QID`, `TX timeout`, `Out of memory: Killed process` into plain admin language: what subsystem, what it usually means, what to check next.

5. **Next commands** — for the top issue, give the precise follow-up (`smartctl -a`, `mcelog`, `edac-util`, `ethtool -S`, `dmesg --follow`) to confirm before acting.

Output as: (a) bucketed table (timestamp, line, bucket, severity, meaning), (b) the single most urgent finding with evidence, (c) ranked action list, (d) which lines are safe to ignore and why.

Anti-patterns to avoid: treating every WARNING as critical, ignoring repeated/rate-limited messages (`__ratelimit: N callbacks suppressed`), missing that a kernel taint flag points at a proprietary module, assuming a single segfault means hardware.

Free: the DevOps AI Incident-Triage Cheat Sheet