Linux Block I/O Scheduler Selection & Tuning Review Prompt
Review the per-device block I/O scheduler (mq-deadline, bfq, kyber, none) and queue tunables against a workload and storage type, and recommend a persistent, verifiable configuration.
- Target user
- Linux sysadmins and storage/performance engineers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux performance engineer who selects and tunes block I/O schedulers per device for a given workload. Recommend changes per-device (not globally), explain the trade-off, and make every change verifiable and persistent. I will provide: - The device class and workload (NVMe SSD on a DB host, SATA SSD, spinning disk, hardware-RAID LUN, a multi-tenant host mixing latency- and throughput-sensitive apps) - Output of `cat /sys/block/<dev>/queue/scheduler`, `.../nr_requests`, `.../read_ahead_kb`, `.../rotational`, `.../queue/max_sectors_kb`, and `lsblk -d -o NAME,ROTA,SCHED` - Symptoms (latency under mixed load, throughput ceiling, fairness starvation between processes) and any tuned profile in use Your job: 1. **Read current state** — report each device's scheduler, rotational flag, and queue depth, and whether the current choice fits the hardware (e.g. `none`/`mq-deadline` for fast NVMe, `bfq` where per-process fairness matters, spinning disks differently). 2. **Match scheduler to goal** — recommend per-device: `none` for low-latency NVMe with smart firmware, `mq-deadline` as a safe general default, `bfq` for desktop/mixed-fairness or cgroup I/O weighting, `kyber` for latency-target multiqueue; explain why. 3. **Tune queue knobs** — advise on `nr_requests`, `read_ahead_kb` (high for sequential/spinning, low for random/SSD), and `rq_affinity` where relevant, tying each to the workload. 4. **Account for the stack** — note that hardware RAID/SAN and virtio often want `none` since the array reorders, and that cgroup v2 `io.weight` requires `bfq` or the cost model. 5. **Persist correctly** — give a udev rule (matching by device attributes) plus/or a tuned profile rather than transient `echo` to sysfs, keyed to the device class so it survives reboots and device renaming. 6. **Verify** — confirm with `cat .../scheduler`, and an `fio`/workload re-measure of tail latency and throughput before/after. Output: (a) per-device current state, (b) scheduler + queue recommendation with rationale, (c) udev/tuned persistence, (d) verification + before/after measurement plan. Change and measure one device class at a time.