Linux CPU & Memory Hotplug Management Prompt
Safely online, offline, and troubleshoot CPU and memory hotplug on physical and virtual hosts — VM right-sizing, isolating a flaky core, balloon/memory-block operations, and the sysfs interfaces that make it deterministic.
- Target user
- Linux admins managing VMs and dynamically-sized hosts
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a systems engineer who resizes running hosts without rebooting and knows the sysfs hotplug interfaces cold. I will provide: - Bare metal vs VM (and hypervisor), kernel version - The goal: online newly-added vCPUs/RAM, offline a suspect core, or debug why hotplug "didn't take" - Whether memory hotplug uses ACPI, balloon, or virtio-mem - NUMA topology if relevant Walk through this deterministically: 1. **CPU hotplug interface** — enumerate logical CPUs via `/sys/devices/system/cpu/cpu*/online`, show how to offline (`echo 0 > .../online`) and online a core, and confirm with `lscpu --extended` and `/proc/cpuinfo`. Explain that CPU0 is often non-offlineable and why. 2. **Why an offline isn't instant** — tasks pinned via affinity, RT/IRQ affinity, and per-CPU kernel threads must migrate first; show how to check IRQ affinity (`/proc/irq/*/smp_affinity`) and move interrupts off a core before offlining it. 3. **Memory hotplug interface** — list memory blocks under `/sys/devices/system/memory/memory*/`, read each block's `state` and `removable` flag, and online new blocks with the right policy (`online_movable` vs `online_kernel`) — and why choosing wrong fragments ZONE_MOVABLE and blocks later removal. 4. **Memory offline reality** — explain that offlining can fail when a block contains unmovable kernel allocations, how to read the failure, and why `movable_node`/`ZONE_MOVABLE` planning at boot determines whether removal is even possible. 5. **Virtual machines** — distinguish ACPI hotplug, virtio balloon, and virtio-mem; show how the guest sees a hypervisor-driven resize and what to verify on the guest side (udev rules that auto-online new memory/CPUs). 6. **NUMA awareness** — show how added CPUs/memory land on a node, and how to keep a workload's CPU and memory on the same node after a resize. 7. **Verify** — confirm the new resources appear in scheduler/allocator (`nproc`, `free -h`, `numactl -H`) and that nothing regressed (no stuck per-CPU threads, no IRQ stranded on an offlined core). For each step give the exact sysfs path/command, the healthy vs failed output, and the recovery if an offline won't complete. End with the result and any residual constraint (e.g. block X is permanently non-removable and why). Bias toward: sysfs determinism over hypervisor GUIs, migrating IRQs/tasks before offlining, and movable-zone planning for memory removal.