Kubernetes Topology Manager & NUMA CPU Policy Prompt
Tune kubelet Topology Manager, CPU Manager, and Memory Manager policies so latency-sensitive pods get NUMA-aligned CPUs and memory instead of cross-socket performance cliffs.
- Target user
- Engineers running latency-sensitive or HPC workloads
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior performance engineer who has fixed tail-latency regressions caused by pods straddling NUMA nodes. I want to design a kubelet topology configuration for latency-sensitive workloads. I will provide: - The node hardware (sockets, cores per socket, NUMA layout, whether NICs/GPUs are NUMA-attached) - The workload's latency requirements and resource shape - Current kubelet config and observed symptoms Your job: 1. **Explain the three managers**: CPU Manager (`static` policy pins exclusive cores to Guaranteed pods), Memory Manager (`Static` policy reserves NUMA-local memory), and Topology Manager (the coordinator that aligns all hint providers to a single NUMA node). 2. **Choose a Topology Manager policy** and justify it: - `none` (default, no alignment) - `best-effort` (prefer alignment, admit anyway) - `restricted` (reject pod if device + CPU + memory hints can't align) - `single-numa-node` (require everything on ONE NUMA node — strongest, most rejections) 3. **Set the prerequisites**: the pod must be Guaranteed QoS (requests == limits, integer CPU) for static CPU pinning to apply; explain why fractional CPU requests silently fall back to the shared pool. 4. **Account for NUMA-attached devices**: if a GPU or SR-IOV NIC lives on socket 1, a `single-numa-node` pod needing it must get CPUs and memory from socket 1 too — show how the device plugin's topology hints drive this. 5. **Reserve system cores**: `reservedSystemCPUs` so kubelet/system daemons don't land on the pinned cores. 6. **Flag the admission cliff**: `restricted`/`single-numa-node` cause `TopologyAffinityError` pod rejections when a node is fragmented — show how to detect and the capacity trade-off. 7. **Mark DESTRUCTIVE**: kubelet config changes require a kubelet restart and may evict/reject running pods; node drain first. Output format: a per-manager explanation, the kubelet config snippet, a sample Guaranteed pod, and detection commands. Do not apply — this is a node-disruptive change to stage carefully. --- Node hardware: [DESCRIBE NUMA LAYOUT] Workload: [DESCRIBE] Current kubelet config + symptoms: ``` [PASTE] ```
Why this prompt works
NUMA-related performance cliffs are nearly invisible from inside Kubernetes: the pod runs, the metrics look fine, but p99 latency is double what the hardware should deliver because the process is reaching memory or a NIC across a socket boundary. The fix lives in three kubelet managers that have to be configured together — CPU Manager to pin cores, Memory Manager to reserve local memory, and Topology Manager to align them — and getting one of the three wrong (or leaving the pod at Burstable QoS) means none of the pinning takes effect, silently.
This prompt works because it refuses to treat the managers in isolation. It makes the assistant explain how Topology Manager coordinates the hint providers, then forces the Guaranteed-QoS-with-integer-CPU prerequisite to the surface — the single most common reason pinning “doesn’t work.” The device-topology section handles the case that actually matters for GPU and SR-IOV workloads, where the CPUs and memory must follow the device to its socket. And the admission-cliff warning makes the capacity cost of single-numa-node explicit before you discover it as a wave of TopologyAffinityError rejections.
Crucially it marks the whole change as node-disruptive: kubelet config edits need a restart, so the prompt insists on draining and rolling one pool at a time rather than applying anything live. The AI designs and explains; you stage the rollout. Related QoS and throttling workflows are in the Kubernetes & Helm guides and the prompt library.
Related prompts
-
Kubernetes CPU Throttling & CFS Quota Diagnosis Prompt
Diagnose latency spikes and tail-latency regressions caused by Linux CFS quota throttling even when average CPU utilization looks low, and right-size CPU requests/limits without over-provisioning.
-
Kubernetes GPU & Device Plugin Debug Prompt
Diagnose GPU scheduling — NVIDIA device plugin, MIG, scheduling, image/driver mismatch, pod stuck without GPU.
-
Kubernetes QoS Class & Guaranteed Workload Design Prompt
Design pod requests and limits to land workloads in the right QoS class (Guaranteed, Burstable, BestEffort) so the most critical pods survive node memory pressure and eviction.