Linux High Load & CPU Runaway Investigation Prompt
Diagnose high load average, CPU saturation, run-queue pressure, IRQ storms, and steal time on Linux servers — distinguish user CPU vs system CPU vs I/O wait vs steal.
- Target user
- Linux sysadmins, SREs, and on-call engineers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux performance engineer with deep experience reading `top`, `vmstat`, `pidstat`, `mpstat`, and `perf` output on busy production servers. You understand the difference between "load average" and "CPU utilization" and you can spot the real bottleneck across user, system, iowait, steal, and softirq time. I will provide: - The symptom (high load alert, slow app response, OOM-precursor warnings, latency spikes) - System type (VM/cloud or bare metal, vCPU count, distro) - `uptime` and `top -bn1 | head -30` - `vmstat 1 5` and `mpstat -P ALL 1 3` - `pidstat -u 1 3` (top CPU consumers) - Optional: `pidstat -d 1 3` (I/O per process), `pidstat -w 1 3` (context switches) - Application logs that show the symptom in user-visible terms Your job: 1. **Decompose the load number** into its components — load avg over 1/5/15m, run-queue depth (`vmstat r`), blocked tasks (`vmstat b`), uninterruptible sleeps. 2. **Classify the saturation type**: - **High `%us` user CPU** → application code; profile with `perf top` or language profilers - **High `%sy` system CPU** → kernel work; syscall storm, network softirq, or filesystem - **High `%wa` iowait** → block I/O; this is NOT CPU saturation but inflates load avg - **High `%st` steal** → noisy-neighbor on the hypervisor (cloud); not your fault - **High `%si` softirq** → network packet processing; check `/proc/interrupts` for IRQ balance - **`b` column nonzero in vmstat** → tasks blocked on D-state I/O, not CPU 3. **Identify the top culprit processes** — but explain WHY they're hot. A process at 100% CPU might be spinning, GC-thrashing, or just legitimately doing work. 4. **Differentiate "real" vs "phantom" load**: load avg counts D-state (uninterruptible sleep) tasks; a load of 30 with `%us+%sy < 50%` is almost always I/O blocked, not CPU. 5. **Suggest the minimum next commands** to confirm: `perf top`, `iostat -xz 1`, `dmesg | tail -50`, `cat /proc/interrupts`, `/proc/<pid>/stack`. 6. **Mark DANGEROUS actions** explicitly: killing the wrong process, changing CPU governor live, disabling CPU mitigations, reset IRQ affinity on a live server. Common failure classes to surface: - GC pause / JIT warm-up storms (Java, .NET) - Python GIL contention masquerading as "single-threaded CPU saturation" - Kernel softirq from NIC IRQ pinned to CPU 0 → "one CPU at 100% softirq, rest idle" - Steal time on cloud VMs → hypervisor neighbor noise - I/O wait inflating load avg without CPU saturation - Spin-loops in misconfigured connection pools or retry loops - Context-switch storms from over-threaded apps --- System type: [VM (provider/instance type) / bare metal] vCPU count: [N] Distro + kernel: [e.g., RHEL 9, 5.14...] Symptom: [DESCRIBE — load value, latency, user complaints] `uptime`: ``` [PASTE] ``` `top -bn1 | head -30`: ``` [PASTE] ``` `vmstat 1 5`: ``` [PASTE] ``` `mpstat -P ALL 1 3`: ``` [PASTE] ``` `pidstat -u 1 3`: ``` [PASTE] ``` App-side symptom (latency, error rate): [DESCRIBE]
Why this prompt works
“Load is high” is the most ambiguous Linux alert. Load average includes uninterruptible-sleep tasks (I/O blocked), so a database doing a big sync can show load=40 with CPUs idle. Junior engineers see load=40 and recommend “scale up CPUs.” This prompt forces the model to walk through %us / %sy / %wa / %st / %si and the runqueue vs blocked-task counts to find the actual saturation.
How to use it
- Capture data during the incident if possible. Post-incident
topwon’t show the hot path. vmstat 1 5is the single most information-dense snapshot — it shows runqueue depth, blocked tasks, swap activity, interrupts, context switches, and CPU breakdown in one place.mpstat -P ALLis crucial for NIC IRQ storms — if one core is at 100% softirq and the rest are idle, you’ve found it.- Always include the application-side symptom. “Latency p99 went from 50ms to 2s” tells the model what success looks like; “load is 40” doesn’t.
Useful commands
# Triage trio (run together)
uptime
vmstat 1 5
mpstat -P ALL 1 3
# Per-process CPU, I/O, context switches
pidstat -u 1 3
pidstat -d 1 3
pidstat -w 1 3
top -Hp <pid> # threads of one process
# IRQ / softirq distribution
cat /proc/interrupts | head -30
cat /proc/softirqs | head -20
mpstat -I SCPU 1 3
# Where is a process sleeping?
sudo cat /proc/<pid>/stack
sudo cat /proc/<pid>/status | grep State
# What syscalls is it making?
sudo strace -c -p <pid> -- careful, adds overhead
# CPU profile (root, brief)
sudo perf top -F 99
sudo perf record -F 99 -p <pid> -g -- sleep 10 && sudo perf report
# Steal time over time (cloud)
sar -P ALL 1 10
# IRQ affinity (check before touching)
cat /proc/irq/<n>/smp_affinity_list
# CPU governor (don't change live without confirming)
cpupower frequency-info | grep "current policy"
Common findings this catches
- Load=20, %us+%sy=10%,
bcolumn = 8 → I/O bound, not CPU. Look atiostat -xz 1instead. - Load=4 on a 4-core box, one core at 100%
%si→ NIC IRQ pinned to CPU 0; rebalance withirqbalanceor set affinity. %stconsistently > 10% on a cloud VM → noisy-neighbor; contact provider, move instance class, or migrate.- Top process at 100% CPU is JVM, threads view shows GC threads → memory pressure manifesting as CPU; check
-Xmxand GC logs. - Context switches > 50k/sec → over-threaded app or lock contention; reduce thread pool size.
- Python process 100% CPU but slow throughput → GIL contention; investigate with
py-spyor move CPU-bound work to subprocess.
Differential cheatsheet
| Symptom | Most likely | Confirm with |
|---|---|---|
| High load, low %us+%sy | I/O blocked tasks (D state) | vmstat b, iostat -xz 1, /proc/<pid>/stack |
| High %us, one process dominant | App CPU bound | perf top -p <pid> or language profiler |
| High %sy across all cores | Syscall storm, FS, or network kernel work | perf top (system-wide), strace -c |
| High %wa | Storage saturation | iostat -xz 1, iotop |
| High %st | Hypervisor neighbor | Provider metrics; migrate |
| High %si on one core | NIC IRQ pinning | /proc/interrupts, IRQ affinity |
| Many short-lived procs | fork storm, possibly cron, or shell loop | pidstat -p ALL -l 1 3 |
When to escalate
- Anything that looks like a hardware issue (steady CPU throttling on bare metal → check
dmesgfor thermal events) — get hands-on to the box. - A consistent steal time pattern on cloud → talk to provider; tuning won’t help.
- Kernel CPU at high
%sywith no obvious userspace cause — likely a kernel bug or driver issue; pull in kernel folks.
Related prompts
-
Linux Disk Full / Inode Exhaustion Diagnosis Prompt
Diagnose why a Linux filesystem is full or out of inodes — including deleted-but-held files, journal bloat, reserved blocks, and hidden mount-shadowed data.
-
Linux OOM Kill & Memory Pressure Investigation Prompt
Diagnose OOM kills, memory pressure, swap thrashing, slab bloat, and cgroup memory limit failures on Linux servers from dmesg OOM banners and /proc data.
-
Linux Server Troubleshooting Prompt
Help diagnose CPU, memory, disk, network, and service issues on Ubuntu or RHEL servers from raw command output.