You are a senior Linux performance engineer with deep experience reading `top`, `vmstat`, `pidstat`, `mpstat`, and `perf` output on busy production servers. You understand the difference between "load average" and "CPU utilization" and you can spot the real bottleneck across user, system, iowait, steal, and softirq time. I will provide: - The symptom (high load alert, slow app response, OOM-precursor warnings, latency spikes) - System type (VM/cloud or bare metal, vCPU count, distro) - `uptime` and `top -bn1 | head -30` - `vmstat 1 5` and `mpstat -P ALL 1 3` - `pidstat -u 1 3` (top CPU consumers) - Optional: `pidstat -d 1 3` (I/O per process), `pidstat -w 1 3` (context switches) - Application logs that show the symptom in user-visible terms Your job: 1. **Decompose the load number** into its components — load avg over 1/5/15m, run-queue depth (`vmstat r`), blocked tasks (`vmstat b`), uninterruptible sleeps. 2. **Classify the saturation type**: - **High `%us` user CPU** → application code; profile with `perf top` or language profilers - **High `%sy` system CPU** → kernel work; syscall storm, network softirq, or filesystem - **High `%wa` iowait** → block I/O; this is NOT CPU saturation but inflates load avg - **High `%st` steal** → noisy-neighbor on the hypervisor (cloud); not your fault - **High `%si` softirq** → network packet processing; check `/proc/interrupts` for IRQ balance - **`b` column nonzero in vmstat** → tasks blocked on D-state I/O, not CPU 3. **Identify the top culprit processes** — but explain WHY they're hot. A process at 100% CPU might be spinning, GC-thrashing, or just legitimately doing work. 4. **Differentiate "real" vs "phantom" load**: load avg counts D-state (uninterruptible sleep) tasks; a load of 30 with `%us+%sy < 50%` is almost always I/O blocked, not CPU. 5. **Suggest the minimum next commands** to confirm: `perf top`, `iostat -xz 1`, `dmesg | tail -50`, `cat /proc/interrupts`, `/proc/<pid>/stack`. 6. **Mark DANGEROUS actions** explicitly: killing the wrong process, changing CPU governor live, disabling CPU mitigations, reset IRQ affinity on a live server. Common failure classes to surface: - GC pause / JIT warm-up storms (Java, .NET) - Python GIL contention masquerading as "single-threaded CPU saturation" - Kernel softirq from NIC IRQ pinned to CPU 0 → "one CPU at 100% softirq, rest idle" - Steal time on cloud VMs → hypervisor neighbor noise - I/O wait inflating load avg without CPU saturation - Spin-loops in misconfigured connection pools or retry loops - Context-switch storms from over-threaded apps --- System type: [VM (provider/instance type) / bare metal] vCPU count: [N] Distro + kernel: [e.g., RHEL 9, 5.14...] Symptom: [DESCRIBE — load value, latency, user complaints] `uptime`: ``` [PASTE] ``` `top -bn1 | head -30`: ``` [PASTE] ``` `vmstat 1 5`: ``` [PASTE] ``` `mpstat -P ALL 1 3`: ``` [PASTE] ``` `pidstat -u 1 3`: ``` [PASTE] ``` App-side symptom (latency, error rate): [DESCRIBE]

Why this prompt works

“Load is high” is the most ambiguous Linux alert. Load average includes uninterruptible-sleep tasks (I/O blocked), so a database doing a big sync can show load=40 with CPUs idle. Junior engineers see load=40 and recommend “scale up CPUs.” This prompt forces the model to walk through %us / %sy / %wa / %st / %si and the runqueue vs blocked-task counts to find the actual saturation.

How to use it

Capture data during the incident if possible. Post-incident top won’t show the hot path.
vmstat 1 5 is the single most information-dense snapshot — it shows runqueue depth, blocked tasks, swap activity, interrupts, context switches, and CPU breakdown in one place.
mpstat -P ALL is crucial for NIC IRQ storms — if one core is at 100% softirq and the rest are idle, you’ve found it.
Always include the application-side symptom. “Latency p99 went from 50ms to 2s” tells the model what success looks like; “load is 40” doesn’t.

Useful commands

# Triage trio (run together)
uptime
vmstat 1 5
mpstat -P ALL 1 3

# Per-process CPU, I/O, context switches
pidstat -u 1 3
pidstat -d 1 3
pidstat -w 1 3
top -Hp <pid>     # threads of one process

# IRQ / softirq distribution
cat /proc/interrupts | head -30
cat /proc/softirqs | head -20
mpstat -I SCPU 1 3

# Where is a process sleeping?
sudo cat /proc/<pid>/stack
sudo cat /proc/<pid>/status | grep State

# What syscalls is it making?
sudo strace -c -p <pid>  -- careful, adds overhead

# CPU profile (root, brief)
sudo perf top -F 99
sudo perf record -F 99 -p <pid> -g -- sleep 10 && sudo perf report

# Steal time over time (cloud)
sar -P ALL 1 10

# IRQ affinity (check before touching)
cat /proc/irq/<n>/smp_affinity_list

# CPU governor (don't change live without confirming)
cpupower frequency-info | grep "current policy"

Common findings this catches

Load=20, %us+%sy=10%, b column = 8 → I/O bound, not CPU. Look at iostat -xz 1 instead.
Load=4 on a 4-core box, one core at 100% %si → NIC IRQ pinned to CPU 0; rebalance with irqbalance or set affinity.
%st consistently > 10% on a cloud VM → noisy-neighbor; contact provider, move instance class, or migrate.
Top process at 100% CPU is JVM, threads view shows GC threads → memory pressure manifesting as CPU; check -Xmx and GC logs.
Context switches > 50k/sec → over-threaded app or lock contention; reduce thread pool size.
Python process 100% CPU but slow throughput → GIL contention; investigate with py-spy or move CPU-bound work to subprocess.

Differential cheatsheet

Symptom	Most likely	Confirm with
High load, low %us+%sy	I/O blocked tasks (D state)	`vmstat b`, `iostat -xz 1`, `/proc/<pid>/stack`
High %us, one process dominant	App CPU bound	`perf top -p <pid>` or language profiler
High %sy across all cores	Syscall storm, FS, or network kernel work	`perf top` (system-wide), `strace -c`
High %wa	Storage saturation	`iostat -xz 1`, `iotop`
High %st	Hypervisor neighbor	Provider metrics; migrate
High %si on one core	NIC IRQ pinning	`/proc/interrupts`, IRQ affinity
Many short-lived procs	fork storm, possibly cron, or shell loop	`pidstat -p ALL -l 1 3`

When to escalate

Anything that looks like a hardware issue (steady CPU throttling on bare metal → check dmesg for thermal events) — get hands-on to the box.
A consistent steal time pattern on cloud → talk to provider; tuning won’t help.
Kernel CPU at high %sy with no obvious userspace cause — likely a kernel bug or driver issue; pull in kernel folks.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

Linux High Load & CPU Runaway Investigation Prompt

Why this prompt works

How to use it

Useful commands

Common findings this catches

Differential cheatsheet

When to escalate

Related prompts

Linux Server Troubleshooting Prompt

Linux OOM Kill & Memory Pressure Investigation Prompt

Linux Disk Full / Inode Exhaustion Diagnosis Prompt

cpupower Frequency Governor Tuning Prompt

Reading prompts? Get all 500 in one free PDF