Skip to content
CloudOps
All prompts
AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux High Load & CPU Runaway Investigation Prompt

Diagnose high load average, CPU saturation, run-queue pressure, IRQ storms, and steal time on Linux servers — distinguish user CPU vs system CPU vs I/O wait vs steal.

Target user
Linux sysadmins, SREs, and on-call engineers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Linux performance engineer with deep experience reading `top`, `vmstat`, `pidstat`, `mpstat`, and `perf` output on busy production servers. You understand the difference between "load average" and "CPU utilization" and you can spot the real bottleneck across user, system, iowait, steal, and softirq time.

I will provide:
- The symptom (high load alert, slow app response, OOM-precursor warnings, latency spikes)
- System type (VM/cloud or bare metal, vCPU count, distro)
- `uptime` and `top -bn1 | head -30`
- `vmstat 1 5` and `mpstat -P ALL 1 3`
- `pidstat -u 1 3` (top CPU consumers)
- Optional: `pidstat -d 1 3` (I/O per process), `pidstat -w 1 3` (context switches)
- Application logs that show the symptom in user-visible terms

Your job:

1. **Decompose the load number** into its components — load avg over 1/5/15m, run-queue depth (`vmstat r`), blocked tasks (`vmstat b`), uninterruptible sleeps.
2. **Classify the saturation type**:
   - **High `%us` user CPU** → application code; profile with `perf top` or language profilers
   - **High `%sy` system CPU** → kernel work; syscall storm, network softirq, or filesystem
   - **High `%wa` iowait** → block I/O; this is NOT CPU saturation but inflates load avg
   - **High `%st` steal** → noisy-neighbor on the hypervisor (cloud); not your fault
   - **High `%si` softirq** → network packet processing; check `/proc/interrupts` for IRQ balance
   - **`b` column nonzero in vmstat** → tasks blocked on D-state I/O, not CPU
3. **Identify the top culprit processes** — but explain WHY they're hot. A process at 100% CPU might be spinning, GC-thrashing, or just legitimately doing work.
4. **Differentiate "real" vs "phantom" load**: load avg counts D-state (uninterruptible sleep) tasks; a load of 30 with `%us+%sy < 50%` is almost always I/O blocked, not CPU.
5. **Suggest the minimum next commands** to confirm: `perf top`, `iostat -xz 1`, `dmesg | tail -50`, `cat /proc/interrupts`, `/proc/<pid>/stack`.
6. **Mark DANGEROUS actions** explicitly: killing the wrong process, changing CPU governor live, disabling CPU mitigations, reset IRQ affinity on a live server.

Common failure classes to surface:
- GC pause / JIT warm-up storms (Java, .NET)
- Python GIL contention masquerading as "single-threaded CPU saturation"
- Kernel softirq from NIC IRQ pinned to CPU 0 → "one CPU at 100% softirq, rest idle"
- Steal time on cloud VMs → hypervisor neighbor noise
- I/O wait inflating load avg without CPU saturation
- Spin-loops in misconfigured connection pools or retry loops
- Context-switch storms from over-threaded apps

---

System type: [VM (provider/instance type) / bare metal]
vCPU count: [N]
Distro + kernel: [e.g., RHEL 9, 5.14...]
Symptom: [DESCRIBE — load value, latency, user complaints]
`uptime`:
```
[PASTE]
```
`top -bn1 | head -30`:
```
[PASTE]
```
`vmstat 1 5`:
```
[PASTE]
```
`mpstat -P ALL 1 3`:
```
[PASTE]
```
`pidstat -u 1 3`:
```
[PASTE]
```
App-side symptom (latency, error rate):
[DESCRIBE]

Why this prompt works

“Load is high” is the most ambiguous Linux alert. Load average includes uninterruptible-sleep tasks (I/O blocked), so a database doing a big sync can show load=40 with CPUs idle. Junior engineers see load=40 and recommend “scale up CPUs.” This prompt forces the model to walk through %us / %sy / %wa / %st / %si and the runqueue vs blocked-task counts to find the actual saturation.

How to use it

  1. Capture data during the incident if possible. Post-incident top won’t show the hot path.
  2. vmstat 1 5 is the single most information-dense snapshot — it shows runqueue depth, blocked tasks, swap activity, interrupts, context switches, and CPU breakdown in one place.
  3. mpstat -P ALL is crucial for NIC IRQ storms — if one core is at 100% softirq and the rest are idle, you’ve found it.
  4. Always include the application-side symptom. “Latency p99 went from 50ms to 2s” tells the model what success looks like; “load is 40” doesn’t.

Useful commands

# Triage trio (run together)
uptime
vmstat 1 5
mpstat -P ALL 1 3

# Per-process CPU, I/O, context switches
pidstat -u 1 3
pidstat -d 1 3
pidstat -w 1 3
top -Hp <pid>     # threads of one process

# IRQ / softirq distribution
cat /proc/interrupts | head -30
cat /proc/softirqs | head -20
mpstat -I SCPU 1 3

# Where is a process sleeping?
sudo cat /proc/<pid>/stack
sudo cat /proc/<pid>/status | grep State

# What syscalls is it making?
sudo strace -c -p <pid>  -- careful, adds overhead

# CPU profile (root, brief)
sudo perf top -F 99
sudo perf record -F 99 -p <pid> -g -- sleep 10 && sudo perf report

# Steal time over time (cloud)
sar -P ALL 1 10

# IRQ affinity (check before touching)
cat /proc/irq/<n>/smp_affinity_list

# CPU governor (don't change live without confirming)
cpupower frequency-info | grep "current policy"

Common findings this catches

  • Load=20, %us+%sy=10%, b column = 8 → I/O bound, not CPU. Look at iostat -xz 1 instead.
  • Load=4 on a 4-core box, one core at 100% %si → NIC IRQ pinned to CPU 0; rebalance with irqbalance or set affinity.
  • %st consistently > 10% on a cloud VM → noisy-neighbor; contact provider, move instance class, or migrate.
  • Top process at 100% CPU is JVM, threads view shows GC threads → memory pressure manifesting as CPU; check -Xmx and GC logs.
  • Context switches > 50k/sec → over-threaded app or lock contention; reduce thread pool size.
  • Python process 100% CPU but slow throughput → GIL contention; investigate with py-spy or move CPU-bound work to subprocess.

Differential cheatsheet

SymptomMost likelyConfirm with
High load, low %us+%syI/O blocked tasks (D state)vmstat b, iostat -xz 1, /proc/<pid>/stack
High %us, one process dominantApp CPU boundperf top -p <pid> or language profiler
High %sy across all coresSyscall storm, FS, or network kernel workperf top (system-wide), strace -c
High %waStorage saturationiostat -xz 1, iotop
High %stHypervisor neighborProvider metrics; migrate
High %si on one coreNIC IRQ pinning/proc/interrupts, IRQ affinity
Many short-lived procsfork storm, possibly cron, or shell looppidstat -p ALL -l 1 3

When to escalate

  • Anything that looks like a hardware issue (steady CPU throttling on bare metal → check dmesg for thermal events) — get hands-on to the box.
  • A consistent steal time pattern on cloud → talk to provider; tuning won’t help.
  • Kernel CPU at high %sy with no obvious userspace cause — likely a kernel bug or driver issue; pull in kernel folks.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.