Linux Block I/O Performance Investigation Prompt
Diagnose slow disk I/O, high iowait, queue depth saturation, and storage performance regressions using iostat, blktrace, fio, and per-device metrics.
- Target user
- Linux sysadmins, SREs, and DBAs debugging storage performance
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux performance engineer who has tuned storage stacks across NVMe, SATA SSD, spinning disk, SAN, and software RAID. You can read `iostat -xz` output the way other engineers read application logs. I will provide: - The symptom (app latency spike, high iowait, queue depth alerts, slow restore, slow startup) - System type: physical / VM / cloud (instance type + EBS/Persistent Disk/Local SSD class) - Output of `iostat -xz 1 10`, `vmstat 1 5`, `mpstat -P ALL 1 5` - The affected filesystem and mount options (`mount | grep <fs>`) - The underlying device topology (`lsblk -f`, LVM/mdraid layers) - Workload characteristics (random vs sequential, read vs write, IO size, fsync rate) Your job: 1. **Decompose iowait honestly**: `%wa` from `iostat`/`top` is "CPU was idle waiting on I/O" — not a saturation metric. High `%wa` with low queue depth means a few synchronous waiters, not saturation. 2. **For each device** in `iostat -xz`, evaluate: - **`r/s` + `w/s`** — IOPS; compare to device-class spec (NVMe ~100k+, SATA SSD ~30k, 7200rpm spinner ~150) - **`rkB/s` + `wkB/s`** — throughput; compare to device or link bandwidth (SATA 600MB/s, NVMe 3GB/s+, cloud-EBS per-volume cap) - **`avgqu-sz`** / **`aqu-sz`** — average queue depth; high means the device is the bottleneck - **`await`** — average ms per I/O (queue + service); compare to device class - **`r_await` / `w_await`** — split — slow reads point to seek-heavy or contended, slow writes to dirty-page flush - **`%util`** — busy time; misleading on parallel-capable devices (NVMe can sustain 100% util at half capacity) 3. **Common saturation patterns**: - **Cloud EBS / Persistent Disk burst credits exhausted** → throughput crashes from "OK" to baseline; check provider metrics - **High `await` with low queue depth** → device is slow per-I/O (latency); look at link, controller, contention - **High queue depth with low `%util`** → request submission bottleneck (driver, CPU, mq tuning) - **Asymmetric reads vs writes** → write cache flush (`barrier`, fsync) bursts; consider `commit=` mount option - **Suddenly slow after running fine** → log device space pressure (LVM thin pool 100%, ZFS pool 90%+, dm-crypt under pressure) 4. **For random IOPS-bound workloads** (databases): suggest `noatime,nodiratime`, `data=writeback` (ext4, with risk explanation), correct block scheduler (`mq-deadline` vs `none` vs `bfq`), readahead tuning. 5. **For throughput-bound workloads**: suggest larger I/O sizes, parallelism (`fio --iodepth=`), driver multiqueue (`nr_requests`). 6. **For VMs in cloud**: surface the per-instance and per-volume caps; tuning inside the VM doesn't beat the provider's throttle. 7. **Mark DESTRUCTIVE actions**: changing scheduler on production, resizing under load, mount-option changes that require remount. --- System: [physical/VM/cloud + instance class] Storage stack: [NVMe / SATA / EBS / etc. + filesystem + LVM/mdraid layers] Symptom: [DESCRIBE] `iostat -xz 1 10`: ``` [PASTE] ``` `vmstat 1 5`: ``` [PASTE] ``` `lsblk -f`, mount options: ``` [PASTE] ``` Workload: [DESCRIBE — IO pattern, sync rate, app]
Why this prompt works
iostat output is a wall of numbers and most “high iowait” debugging stops at “more IOPS!” This prompt forces a column-by-column read so you distinguish slow per-I/O (await) from saturation (queue depth) — entirely different fixes.
How to use it
- Always include the device class. “Slow disk” on an NVMe is different from a 7200rpm spinner.
- Run iostat over a window, not a single snapshot.
iostat -xz 1 10captures bursts. - For cloud VMs, include provider metrics alongside
iostat— internal view can’t see the throttle. - Identify the workload pattern (random vs sequential, sync vs buffered). Tuning differs.
Useful commands
# Triage
iostat -xz 1 10 # per-device extended stats
vmstat 1 5 # bi/bo, blocked tasks, swap
mpstat -P ALL 1 5 # per-CPU %wa
dstat -tcdmn 1 # combined view (if installed)
# Per-process I/O
sudo iotop -oP
pidstat -d 1 5
# Block trace (deep)
sudo blktrace -d /dev/nvme0n1 -o trace
# In another shell: hit the workload, then Ctrl-C
sudo blkparse -i trace | head -200
# Filesystem latency
sudo bpftrace -e 'kprobe:vfs_read { @ns[comm] = hist(nsecs); }' # eBPF
sudo perf trace -e 'block:*' -a sleep 5
# Queue / scheduler
cat /sys/block/<dev>/queue/scheduler
cat /sys/block/<dev>/queue/nr_requests
cat /sys/block/<dev>/queue/read_ahead_kb
# Benchmark (NEVER on live production data)
fio --name=randread --rw=randread --bs=4k --iodepth=32 \
--runtime=30 --time_based --direct=1 --filename=/dev/<TEST-DEV>
# Mount options
mount | grep <fs>
sudo tune2fs -l /dev/<dev> | head -20 # ext4
# Cloud-specific
# AWS: CloudWatch VolumeReadBytes/VolumeWriteBytes, BurstBalance
# GCP: Cloud Monitoring disk/* metrics
# Azure: Premium SSD perf tier vs IOPS used
Differential cheatsheet
| Symptom | Likely cause | Confirm |
|---|---|---|
High await, low aqu-sz | Per-I/O latency (link, controller, encryption) | Per-link tests; check dm-crypt; LUN paths |
High aqu-sz (>2-4), high %util | Device saturation | Compare IOPS/throughput to spec |
%wa high, aqu-sz low | Few synchronous waiters (fsync-heavy) | pidstat -d; identify app sync pattern |
| Sudden cliff | Burst credits exhausted / dm-thin full / ZFS ARC pressure | Provider metrics; dmsetup status; zpool list |
High write await only | Cache flush / journal pressure | commit=, mount options, log device |
| Stable IOPS at exactly N | Provider cap | Provider docs for volume class |
Common findings this catches
- EBS GP3 baseline 3000 IOPS reached → bursts to 16k briefly, then throttle. Provision more, or move to io2.
- NVMe at 100% util but 60% of spec → driver queue depth too low;
nr_requeststuning or useblk-mqwith multiple queues. await>10ms on local SSD → likely controller/firmware issue or saturation; benchmark to confirm.data=ordered(ext4 default) with fsync-heavy workload → high writeawait. Testdata=writebackcarefully.- dm-thin pool >80% full → silent latency spike from metadata pressure; expand or rebalance.
- LVM cache (
lvmcache) thrashing when working set > cache size → disable cache or grow.
Tuning starter pack
# Per-workload mount options (ext4, IOPS-heavy DB)
mount -o remount,noatime,nodiratime /data
# Scheduler (NVMe: usually `none`; SSD: `mq-deadline`; HDD: `bfq`)
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
# Increase request queue depth (NVMe)
echo 1024 | sudo tee /sys/block/nvme0n1/queue/nr_requests
# Reduce dirty page pressure (large memory boxes)
sudo sysctl -w vm.dirty_background_bytes=$((256*1024*1024))
sudo sysctl -w vm.dirty_bytes=$((1024*1024*1024))
When to escalate
- Provider throttle suspected — open a support ticket with metric evidence; tuning inside the VM won’t help.
- Hardware errors in
dmesg(UNC, CRC, link reset) — disk replacement, not tuning. - Database-specific (PostgreSQL/MySQL) sync patterns dominating — coordinate with DBA on
commit=, sync vs async replication.
Related prompts
-
Linux High Load & CPU Runaway Investigation Prompt
Diagnose high load average, CPU saturation, run-queue pressure, IRQ storms, and steal time on Linux servers — distinguish user CPU vs system CPU vs I/O wait vs steal.
-
Linux Disk Full / Inode Exhaustion Diagnosis Prompt
Diagnose why a Linux filesystem is full or out of inodes — including deleted-but-held files, journal bloat, reserved blocks, and hidden mount-shadowed data.
-
Linux NUMA Imbalance Investigation Prompt
Diagnose NUMA-related performance issues — cross-node memory access, allocation imbalance, scheduler migration, and how to pin workloads to nodes.