Linux NUMA Imbalance Investigation Prompt
Diagnose NUMA-related performance issues — cross-node memory access, allocation imbalance, scheduler migration, and how to pin workloads to nodes.
- Target user
- Performance engineers and DBAs on multi-socket Linux servers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior performance engineer who has tuned NUMA-aware workloads on dual-socket and quad-socket servers. You can read `numastat` and `lscpu` to spot cross-node traffic that's killing memory bandwidth. I will provide: - The symptom (DB latency p99 spike, throughput plateau, inconsistent benchmark results) - Hardware: socket count, cores per socket, NUMA topology (`numactl --hardware`, `lscpu | grep -i numa`) - Workload: single-process (DB, JVM) or many small processes? Threaded? Memory-resident size? - Output of `numastat -m` (memory per node) and `numastat -p <pid>` (per-process) - `/sys/fs/cgroup/.../cpuset.cpus` and `cpuset.mems` if cgroup-pinned Your job: 1. **Map the topology**: how many NUMA nodes? Which CPUs belong to which node? How is memory distributed? Are there CPU-less nodes (rare; HBM systems)? 2. **Identify imbalance**: - **`numastat -m` columns** = per-node; check `MemFree`, `MemUsed`, `HugePages_Free`. Asymmetric usage hints at allocation drift. - **`numastat -p <pid>`** shows per-process breakdown of memory usage per node. A single-threaded process using memory from a node OTHER than where it's running = cross-node access. - **`numa_miss` / `numa_foreign`** counters (in `numastat -m`) — non-zero means processes ran out of local memory and allocated remotely. 3. **Common NUMA pathologies**: - **Memory not pinned** → kernel allocates from local node at first-touch; if thread migrates, the memory is now remote. Pin threads or pre-fault on the right node. - **First-touch policy + early init** → init thread on node 0 touches all pages; entire memory is on node 0; worker threads on node 1 access remotely. - **Memory mirroring (interleave)** wasted on workloads that fit one node → use bind or preferred. - **NUMA-unaware DB (e.g., PostgreSQL pre-NUMA-aware versions)** → run with `numactl --interleave=all` for predictability. - **JVM heap larger than one node** → heap spans nodes; G1/ZGC threads may access remote; use `+UseNUMA` (G1) for awareness. - **VM in a cloud with NUMA-pass-through** → tune like bare metal. 4. **For each suspect process** recommend the pinning strategy: - **`numactl --membind=N --cpunodebind=N <cmd>`** — strict binding (single-node) - **`numactl --interleave=all <cmd>`** — round-robin allocation (good for memory-heavy DB that doesn't easily partition) - **`numactl --preferred=N`** — prefer node N, fall back to others - **cgroup `cpuset`** for per-container binding - **`taskset -c` + numactl** — when thread affinity matters 5. **For multi-instance setups** (e.g., 2 DB instances on a 2-socket box): - One instance per node, each `numactl`-bound to its node — usually outperforms a single instance spanning sockets. 6. **Verify the fix** with `numastat -p <pid>` after pinning — `Total` should be concentrated on the bound node(s). 7. **`vm.zone_reclaim_mode` warning**: on some kernels, set to 1 causes aggressive same-node reclamation that slows allocation. Default 0 is usually right; verify on your workload. Mark DESTRUCTIVE: re-pinning a live database (may cause memory migration during a `numa_balancing` sweep), disabling `numa_balancing` cluster-wide. --- Hardware: ``` [PASTE `numactl --hardware`] ``` Symptom + measurement: [DESCRIBE] Workload context: [single process / many; threaded; memory size] `numastat -m`: ``` [PASTE] ``` `numastat -p <pid>`: ``` [PASTE for relevant pid] ``` Current pinning (if any): [DESCRIBE]
Why this prompt works
NUMA effects are invisible on top and vmstat — they show up as slow application performance with idle CPU and unexplained latency p99. numastat is the only place where they’re obvious. This prompt forces a topology-aware walk and proposes specific pinning rather than vague advice.
How to use it
- Always include
numactl --hardware— it shows nodes, CPUs per node, and inter-node distances. - Single-process vs many-process workloads need different strategies — call out which.
- Capture
numastat -p <pid>before AND after any pinning change — that’s the proof. - Mention if it’s a VM — many cloud VMs are single-NUMA-node (UMA); tuning may be unnecessary.
Useful commands
# Topology
numactl --hardware
lscpu | grep -i numa
ls /sys/devices/system/node/
cat /sys/devices/system/node/node*/distance
# Per-node memory
numastat -m
free -h # global view doesn't show per-node
# Per-process memory by node
numastat -p <pid>
cat /proc/<pid>/numa_maps | head # per-mapping detail
# Per-node counters (kernel)
cat /sys/devices/system/node/node0/numastat # numa_hit/miss/foreign
# Pinning
numactl --hardware
numactl --cpunodebind=0 --membind=0 ./app
numactl --interleave=all ./db-server
numactl --preferred=1 ./app
taskset -c 0-15 ./app # CPU subset (cores 0-15)
sched_setaffinity equivalent via taskset -p
# Cgroup-level (containers)
echo 0 | sudo tee /sys/fs/cgroup/<slice>/cpuset.mems
echo 0-15 | sudo tee /sys/fs/cgroup/<slice>/cpuset.cpus
# Verify pinning took
cat /proc/<pid>/status | grep -i node
numastat -p <pid>
# Migration tracking (kernel)
cat /proc/<pid>/sched | grep numa
cat /sys/kernel/debug/sched_debug | head # detailed sched view
# Automatic NUMA balancing
cat /proc/sys/kernel/numa_balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing # disable (with care)
Pinning patterns
Single big-memory process (database)
# Option A: bind to one node (if memory fits)
numactl --cpunodebind=0 --membind=0 postgres -D /data
# Option B: interleave across all nodes (memory > one node, fair across)
numactl --interleave=all postgres -D /data
Multiple instances, one per node
# Instance 0 on node 0
numactl --cpunodebind=0 --membind=0 mysqld --port=3306 --datadir=/data/inst0
# Instance 1 on node 1
numactl --cpunodebind=1 --membind=1 mysqld --port=3307 --datadir=/data/inst1
systemd unit pinning
[Service]
ExecStart=/usr/bin/myapp
CPUAffinity=0-15
NUMAPolicy=bind
NUMAMask=0
Container (Docker/Podman)
docker run --cpuset-cpus=0-15 --cpuset-mems=0 myapp:latest
podman run --cpuset-cpus=0-15 --cpuset-mems=0 myapp:latest
# Kubernetes — requires CPUManager static policy + Topology Manager
# kubelet config:
# cpuManagerPolicy: static
# topologyManagerPolicy: single-numa-node
Common findings this catches
- High
numa_foreigncounter on node 0 → cross-node allocations; pin or interleave. - Single-threaded init touches all pages on node 0 → all memory on node 0; worker threads on node 1 access remotely. Solution: per-thread first-touch or
interleave. - JVM with
-Xmx32gon a 2-socket box with 16g per node → heap spans nodes. Use-XX:+UseNUMAfor G1, or split into two JVMs. - Container with no
cpuset.memsbutcpuset.cpus=0-15→ memory free-for-all; pin both. vm.zone_reclaim_mode=1on a DB host → known bad; revert to 0.- Cloud VM showing 1 NUMA node → no benefit from pinning; remove
numactloverhead.
Verifying with hardware counters
# perf — measure cache misses suggesting NUMA cross-traffic
sudo perf stat -e \
cache-misses,cache-references,\
node-loads,node-load-misses,\
node-stores,node-store-misses \
-p <pid> sleep 30
High node-load-misses / node-loads ratio = significant remote-node access.
When to escalate
- Kernel
numa_balancingcausing periodic latency spikes — coordinated with platform team; consider disabling. - Memory hot-add / hot-remove on a NUMA system — engage hardware team; behavior is firmware-specific.
- VM hypervisor without NUMA pass-through that’s the bottleneck — request NUMA-aware placement from cloud provider, or move to a NUMA-passthrough class.
Related prompts
-
Linux Block I/O Performance Investigation Prompt
Diagnose slow disk I/O, high iowait, queue depth saturation, and storage performance regressions using iostat, blktrace, fio, and per-device metrics.
-
Linux Context Switch & Lock Contention Diagnosis Prompt
Diagnose context-switch storms, futex contention, kernel-level lock waits, and CPU scheduling pathologies that masquerade as 'app is slow.'
-
Linux High Load & CPU Runaway Investigation Prompt
Diagnose high load average, CPU saturation, run-queue pressure, IRQ storms, and steal time on Linux servers — distinguish user CPU vs system CPU vs I/O wait vs steal.