Diagnosing High Load on Linux: CPU, Memory, and I/O

“Load is high” is the least useful alert in existence, because load average lumps together three completely different problems: the CPU is busy, the disk is slow, or you’re out of memory and swapping. Fix the wrong one and you’ve wasted twenty minutes.

After 25 years of “the box is slow” tickets, here’s the order I check things in, and how AI helps read the numbers without sending you down the wrong path.

What load average actually measures

uptime gives you three numbers — 1, 5, and 15-minute averages:

uptime
# 14:32:01 up 40 days, load average: 8.21, 6.40, 4.10

Load average is the number of processes running or waiting — and crucially, on Linux that includes processes blocked on uninterruptible I/O. So a load of 8 on an 8-core box might mean every core is pegged, or it might mean almost nothing is using CPU and everything is waiting on a slow disk. Same number, opposite causes.

Rule of thumb: divide load by core count (nproc). Above ~1.0 per core sustained means contention. But it tells you that there’s a problem, not which problem.

Step 1: Is it CPU or I/O wait?

top (press 1 to see per-core) or better, vmstat:

vmstat 1 5

Watch these columns:

r — processes wanting CPU. If r > cores, you’re CPU-bound.
wa — % time waiting on I/O. High wa means the disk, not the CPU, is your problem.
si/so — swap in/out. Anything sustained here means memory pressure.

This single command splits the problem three ways in five seconds. If wa is 40%, stop looking at CPU.

Step 2: If it’s CPU

Find the offender:

top -o %CPU
pidstat -u 1 5

pidstat is better than top for catching short-lived spikes because it shows per-process averages over the interval. Once you have a PID, see what it’s doing:

sudo strace -c -p <PID>   # syscall summary, brief sample

Paste the pidstat output and the process command line into AI:

“This process is using 700% CPU on an 8-core box. Here’s its command line and a pidstat sample. What are the likely causes for this workload, and what read-only commands narrow it down?”

The model is good at pattern-matching “this is a GC storm” or “this is a busy-wait loop” from the shape of the data. I keep these in my Linux prompts.

Step 3: If it’s memory

Start with the honest view:

free -h

Ignore the scary “used” number — Linux uses free RAM for cache deliberately. The column that matters is available. If available is near zero and free -h shows swap filling, you have real pressure.

Then find who’s eating it:

ps aux --sort=-%mem | head

Check the kernel’s OOM history — this is the one people miss:

journalctl -k | grep -i "out of memory\|oom-kill"
dmesg -T | grep -i oom

If the OOM killer has been firing, that’s your answer: something requested more than you have, and the kernel started killing processes. AI is excellent at reading an OOM kill block and telling you which cgroup or process triggered it.

Step 4: If it’s I/O

Confirm with iostat:

iostat -xz 1 5

The columns that matter:

%util — near 100% means the device is saturated.
await — average ms per I/O. Spinning disks at 20ms+ or SSDs at single-digit-ms-plus under load means trouble.
aqu-sz — queue depth. A deep queue means requests are backing up.

Find the process doing the I/O:

sudo iotop -o

Or, without iotop:

sudo pidstat -d 1 5

Let AI assemble the picture

The real value of AI here is correlation across all four tools at once. Capture a snapshot:

{ uptime; echo ---; vmstat 1 3; echo ---; free -h; echo ---; iostat -xz 1 3; } > /tmp/perf.txt

Then:

“Here’s a snapshot of uptime, vmstat, free, and iostat from a server with high load. Tell me whether this is CPU-bound, memory-bound, or I/O-bound, point to the specific numbers that prove it, and give me the next read-only command to identify the responsible process.”

This is the same evidence-first, command-second discipline I use during production incident triage: the model reads and reasons, you run the commands. It keeps you from “fixing” CPU when the disk is the bottleneck.

Two things AI gets wrong

It over-trusts a single number. A high load average alone proves nothing. Make it justify its diagnosis with vmstat and iostat columns, not just the headline.
It suggests destructive fixes early. Models love kill -9 and “just restart it.” Confirm the cause first; a restart that hides an OOM loop or a disk-full condition buys you fifteen minutes and a worse 3am.

The shortcut that always works

When in doubt: vmstat 1 first. The r, wa, and si/so columns triage CPU-vs-IO-vs-memory faster than any dashboard. Then drill in with the right tool — pidstat for CPU, free/OOM logs for memory, iostat/iotop for I/O — and let AI correlate the snapshot. Diagnose the real bottleneck, fix it once.

Performance diagnoses from AI are assistive. Verify against your own metrics before acting.