AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux CPU & Memory Hotplug Management Prompt

Safely online, offline, and troubleshoot CPU and memory hotplug on physical and virtual hosts — VM right-sizing, isolating a flaky core, balloon/memory-block operations, and the sysfs interfaces that make it deterministic.

Target user: Linux admins managing VMs and dynamically-sized hosts
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a systems engineer who resizes running hosts without rebooting and knows the sysfs hotplug interfaces cold.

I will provide:
- Bare metal vs VM (and hypervisor), kernel version
- The goal: online newly-added vCPUs/RAM, offline a suspect core, or debug why hotplug "didn't take"
- Whether memory hotplug uses ACPI, balloon, or virtio-mem
- NUMA topology if relevant

Walk through this deterministically:

1. **CPU hotplug interface** — enumerate logical CPUs via `/sys/devices/system/cpu/cpu*/online`, show how to offline (`echo 0 > .../online`) and online a core, and confirm with `lscpu --extended` and `/proc/cpuinfo`. Explain that CPU0 is often non-offlineable and why.

2. **Why an offline isn't instant** — tasks pinned via affinity, RT/IRQ affinity, and per-CPU kernel threads must migrate first; show how to check IRQ affinity (`/proc/irq/*/smp_affinity`) and move interrupts off a core before offlining it.

3. **Memory hotplug interface** — list memory blocks under `/sys/devices/system/memory/memory*/`, read each block's `state` and `removable` flag, and online new blocks with the right policy (`online_movable` vs `online_kernel`) — and why choosing wrong fragments ZONE_MOVABLE and blocks later removal.

4. **Memory offline reality** — explain that offlining can fail when a block contains unmovable kernel allocations, how to read the failure, and why `movable_node`/`ZONE_MOVABLE` planning at boot determines whether removal is even possible.

5. **Virtual machines** — distinguish ACPI hotplug, virtio balloon, and virtio-mem; show how the guest sees a hypervisor-driven resize and what to verify on the guest side (udev rules that auto-online new memory/CPUs).

6. **NUMA awareness** — show how added CPUs/memory land on a node, and how to keep a workload's CPU and memory on the same node after a resize.

7. **Verify** — confirm the new resources appear in scheduler/allocator (`nproc`, `free -h`, `numactl -H`) and that nothing regressed (no stuck per-CPU threads, no IRQ stranded on an offlined core).

For each step give the exact sysfs path/command, the healthy vs failed output, and the recovery if an offline won't complete. End with the result and any residual constraint (e.g. block X is permanently non-removable and why).

Bias toward: sysfs determinism over hypervisor GUIs, migrating IRQs/tasks before offlining, and movable-zone planning for memory removal.

Free: the DevOps AI Incident-Triage Cheat Sheet