NUMA-Aware Scheduling With the Kubernetes Topology Manager

A trading-system team once handed me a workload whose p99 latency was almost exactly double its p50, on hardware that should have been faster than that. The CPU graphs were unremarkable, memory had headroom, and the application code hadn’t changed. The problem turned out to live below Kubernetes entirely: the pod’s threads were running on one CPU socket while reaching across the interconnect to memory attached to the other socket. Every cache miss paid a NUMA tax. The scheduler had no idea, because as far as Kubernetes was concerned the pod fit and was running fine.

This is the class of problem the kubelet’s Topology Manager exists to solve, and it’s invisible from the usual dashboards. Getting it right means configuring three kubelet managers that only work as a set, and accepting a real capacity trade-off in exchange for predictable latency.

The three managers that work together

NUMA alignment in Kubernetes isn’t one feature, it’s three cooperating ones:

CPU Manager with the static policy pins exclusive physical cores to Guaranteed-QoS pods so they don’t float across the machine.
Memory Manager with the Static policy reserves memory on a specific NUMA node for those pods.
Topology Manager is the coordinator: it collects hints from CPU Manager, Memory Manager, and device plugins, then decides whether the pod can be placed with everything on a single NUMA node.

Miss any one of these and the others quietly do nothing useful. The most common mistake is enabling Topology Manager but leaving CPU Manager on its default none policy — there’s nothing to align, so the pod still floats.

The QoS prerequisite nobody mentions

Here’s the gotcha that wastes the most time: static CPU pinning applies only to Guaranteed-QoS pods with integer CPU requests. The moment a request is fractional, the pod silently falls back to the shared CPU pool — no warning, no event, just no pinning.

# This pod gets exclusive, pinned cores
resources:
  requests:
    cpu: "4"          # integer
    memory: "8Gi"
  limits:
    cpu: "4"          # equal to requests -> Guaranteed
    memory: "8Gi"

# This pod does NOT — the 3500m request drops it to the shared pool
resources:
  requests:
    cpu: "3500m"      # fractional -> no exclusive cores, ever
    memory: "8Gi"
  limits:
    cpu: "3500m"
    memory: "8Gi"

If your “pinned” workload isn’t behaving, the first thing to check is whether it’s actually Guaranteed with integer CPU. Nine times out of ten, that’s the bug.

Choosing a Topology Manager policy

The policy controls how strict alignment is, traded against how often pods get rejected:

Policy	Behavior
`none`	No alignment (default).
`best-effort`	Prefer single-NUMA alignment, admit the pod anyway if it can’t.
`restricted`	Reject the pod if CPU/memory/device hints can’t align.
`single-numa-node`	Require everything on one NUMA node — strongest, most rejections.

You set this in the kubelet config, along with the prerequisites and the system-core reservation that keeps daemons off your pinned cores:

# kubelet config
cpuManagerPolicy: static
memoryManagerPolicy: Static
topologyManagerPolicy: single-numa-node
reservedSystemCPUs: "0-1"
reservedMemory:
  - numaNode: 0
    limits:
      memory: "1Gi"

reservedSystemCPUs is not optional in practice — without it, kubelet and system daemons compete for the cores you meant to dedicate, defeating the whole exercise.

Devices have topology too

If the workload needs a GPU or an SR-IOV NIC, NUMA alignment has to include that device. A GPU physically attached to socket 1 means a single-numa-node pod using it must also get its CPUs and memory from socket 1. The device plugin advertises topology hints, and Topology Manager folds them into the same alignment decision. This is the case where alignment matters most and fragmentation bites hardest — there may simply not be enough free cores on the GPU’s socket to satisfy the request.

Prompt: Here is a node with two sockets, 24 cores each, and a GPU on NUMA node 1. Here is a Guaranteed pod requesting 8 CPUs, 16Gi, and one GPU. Walk through what Topology Manager does under single-numa-node, what causes a TopologyAffinityError, and what kubelet config the node needs. Explanation and config only — no commands to apply.

Output (excerpt): The pod requires 8 cores + 16Gi + the GPU all on NUMA node 1. If node 1 has fewer than 8 free pinnable cores, admission fails with TopologyAffinityError and the pod stays Pending. Node needs cpuManagerPolicy: static, memoryManagerPolicy: Static, topologyManagerPolicy: single-numa-node, and reservedSystemCPUs outside the GPU socket. Verify with kubectl describe node and the kubelet’s topology_manager_admission_errors_total metric.

This is the kind of reasoning an AI assistant handles well — it knows the manager interactions and the QoS rules, and it produces config you review rather than apply. I keep it advisory because the change is genuinely disruptive: editing kubelet config requires a kubelet restart, which can evict or reject running pods. The model never touches a node; it explains, I drain and roll.

The capacity cliff

The price of strict alignment is admission failures. Under single-numa-node or restricted, a fragmented node — one where free cores are split across sockets — will reject pods it would happily run under none. That’s not a bug, it’s the policy working, but it reduces effective cluster capacity and shows up as TopologyAffinityError events. Watch the admission-error metrics before you roll a strict policy fleet-wide, and stage it one node pool at a time. Pair this with sane QoS and Guaranteed-workload design so the pods you’re pinning are actually eligible for pinning in the first place.

# Watch for alignment rejections after enabling a strict policy
kubectl get events --field-selector reason=TopologyAffinityError -A

Wrapping up

NUMA performance cliffs are real, expensive, and nearly invisible from standard metrics, but the fix is well-defined: enable CPU Manager, Memory Manager, and Topology Manager together, make sure your latency-sensitive pods are Guaranteed with integer CPU, reserve system cores, and account for the socket your devices live on. Then accept that strict alignment costs you some capacity and roll it out gradually while watching admission errors. Let an AI assistant reason through the manager interactions and draft the kubelet config; keep the node drains and restarts in human hands. More performance and scheduling deep-dives are in the Kubernetes & Helm guides, with reusable starting points in the prompt library.