Skip to content
CloudOps
Newsletter
All prompts
AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux NUMA Imbalance Investigation Prompt

Diagnose NUMA-related performance issues — cross-node memory access, allocation imbalance, scheduler migration, and how to pin workloads to nodes.

Target user
Performance engineers and DBAs on multi-socket Linux servers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior performance engineer who has tuned NUMA-aware workloads on dual-socket and quad-socket servers. You can read `numastat` and `lscpu` to spot cross-node traffic that's killing memory bandwidth.

I will provide:
- The symptom (DB latency p99 spike, throughput plateau, inconsistent benchmark results)
- Hardware: socket count, cores per socket, NUMA topology (`numactl --hardware`, `lscpu | grep -i numa`)
- Workload: single-process (DB, JVM) or many small processes? Threaded? Memory-resident size?
- Output of `numastat -m` (memory per node) and `numastat -p <pid>` (per-process)
- `/sys/fs/cgroup/.../cpuset.cpus` and `cpuset.mems` if cgroup-pinned

Your job:

1. **Map the topology**: how many NUMA nodes? Which CPUs belong to which node? How is memory distributed? Are there CPU-less nodes (rare; HBM systems)?
2. **Identify imbalance**:
   - **`numastat -m` columns** = per-node; check `MemFree`, `MemUsed`, `HugePages_Free`. Asymmetric usage hints at allocation drift.
   - **`numastat -p <pid>`** shows per-process breakdown of memory usage per node. A single-threaded process using memory from a node OTHER than where it's running = cross-node access.
   - **`numa_miss` / `numa_foreign`** counters (in `numastat -m`) — non-zero means processes ran out of local memory and allocated remotely.
3. **Common NUMA pathologies**:
   - **Memory not pinned** → kernel allocates from local node at first-touch; if thread migrates, the memory is now remote. Pin threads or pre-fault on the right node.
   - **First-touch policy + early init** → init thread on node 0 touches all pages; entire memory is on node 0; worker threads on node 1 access remotely.
   - **Memory mirroring (interleave)** wasted on workloads that fit one node → use bind or preferred.
   - **NUMA-unaware DB (e.g., PostgreSQL pre-NUMA-aware versions)** → run with `numactl --interleave=all` for predictability.
   - **JVM heap larger than one node** → heap spans nodes; G1/ZGC threads may access remote; use `+UseNUMA` (G1) for awareness.
   - **VM in a cloud with NUMA-pass-through** → tune like bare metal.
4. **For each suspect process** recommend the pinning strategy:
   - **`numactl --membind=N --cpunodebind=N <cmd>`** — strict binding (single-node)
   - **`numactl --interleave=all <cmd>`** — round-robin allocation (good for memory-heavy DB that doesn't easily partition)
   - **`numactl --preferred=N`** — prefer node N, fall back to others
   - **cgroup `cpuset`** for per-container binding
   - **`taskset -c` + numactl** — when thread affinity matters
5. **For multi-instance setups** (e.g., 2 DB instances on a 2-socket box):
   - One instance per node, each `numactl`-bound to its node — usually outperforms a single instance spanning sockets.
6. **Verify the fix** with `numastat -p <pid>` after pinning — `Total` should be concentrated on the bound node(s).
7. **`vm.zone_reclaim_mode` warning**: on some kernels, set to 1 causes aggressive same-node reclamation that slows allocation. Default 0 is usually right; verify on your workload.

Mark DESTRUCTIVE: re-pinning a live database (may cause memory migration during a `numa_balancing` sweep), disabling `numa_balancing` cluster-wide.

---

Hardware:
```
[PASTE `numactl --hardware`]
```
Symptom + measurement: [DESCRIBE]
Workload context: [single process / many; threaded; memory size]
`numastat -m`:
```
[PASTE]
```
`numastat -p <pid>`:
```
[PASTE for relevant pid]
```
Current pinning (if any): [DESCRIBE]

Why this prompt works

NUMA effects are invisible on top and vmstat — they show up as slow application performance with idle CPU and unexplained latency p99. numastat is the only place where they’re obvious. This prompt forces a topology-aware walk and proposes specific pinning rather than vague advice.

How to use it

  1. Always include numactl --hardware — it shows nodes, CPUs per node, and inter-node distances.
  2. Single-process vs many-process workloads need different strategies — call out which.
  3. Capture numastat -p <pid> before AND after any pinning change — that’s the proof.
  4. Mention if it’s a VM — many cloud VMs are single-NUMA-node (UMA); tuning may be unnecessary.

Useful commands

# Topology
numactl --hardware
lscpu | grep -i numa
ls /sys/devices/system/node/
cat /sys/devices/system/node/node*/distance

# Per-node memory
numastat -m
free -h    # global view doesn't show per-node

# Per-process memory by node
numastat -p <pid>
cat /proc/<pid>/numa_maps | head    # per-mapping detail

# Per-node counters (kernel)
cat /sys/devices/system/node/node0/numastat   # numa_hit/miss/foreign

# Pinning
numactl --hardware
numactl --cpunodebind=0 --membind=0 ./app
numactl --interleave=all ./db-server
numactl --preferred=1 ./app
taskset -c 0-15 ./app                # CPU subset (cores 0-15)
sched_setaffinity equivalent via taskset -p

# Cgroup-level (containers)
echo 0 | sudo tee /sys/fs/cgroup/<slice>/cpuset.mems
echo 0-15 | sudo tee /sys/fs/cgroup/<slice>/cpuset.cpus

# Verify pinning took
cat /proc/<pid>/status | grep -i node
numastat -p <pid>

# Migration tracking (kernel)
cat /proc/<pid>/sched | grep numa
cat /sys/kernel/debug/sched_debug | head    # detailed sched view

# Automatic NUMA balancing
cat /proc/sys/kernel/numa_balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing   # disable (with care)

Pinning patterns

Single big-memory process (database)

# Option A: bind to one node (if memory fits)
numactl --cpunodebind=0 --membind=0 postgres -D /data

# Option B: interleave across all nodes (memory > one node, fair across)
numactl --interleave=all postgres -D /data

Multiple instances, one per node

# Instance 0 on node 0
numactl --cpunodebind=0 --membind=0 mysqld --port=3306 --datadir=/data/inst0
# Instance 1 on node 1
numactl --cpunodebind=1 --membind=1 mysqld --port=3307 --datadir=/data/inst1

systemd unit pinning

[Service]
ExecStart=/usr/bin/myapp
CPUAffinity=0-15
NUMAPolicy=bind
NUMAMask=0

Container (Docker/Podman)

docker run --cpuset-cpus=0-15 --cpuset-mems=0 myapp:latest
podman run --cpuset-cpus=0-15 --cpuset-mems=0 myapp:latest

# Kubernetes — requires CPUManager static policy + Topology Manager
# kubelet config:
#   cpuManagerPolicy: static
#   topologyManagerPolicy: single-numa-node

Common findings this catches

  • High numa_foreign counter on node 0 → cross-node allocations; pin or interleave.
  • Single-threaded init touches all pages on node 0 → all memory on node 0; worker threads on node 1 access remotely. Solution: per-thread first-touch or interleave.
  • JVM with -Xmx32g on a 2-socket box with 16g per node → heap spans nodes. Use -XX:+UseNUMA for G1, or split into two JVMs.
  • Container with no cpuset.mems but cpuset.cpus=0-15 → memory free-for-all; pin both.
  • vm.zone_reclaim_mode=1 on a DB host → known bad; revert to 0.
  • Cloud VM showing 1 NUMA node → no benefit from pinning; remove numactl overhead.

Verifying with hardware counters

# perf — measure cache misses suggesting NUMA cross-traffic
sudo perf stat -e \
  cache-misses,cache-references,\
  node-loads,node-load-misses,\
  node-stores,node-store-misses \
  -p <pid> sleep 30

High node-load-misses / node-loads ratio = significant remote-node access.

When to escalate

  • Kernel numa_balancing causing periodic latency spikes — coordinated with platform team; consider disabling.
  • Memory hot-add / hot-remove on a NUMA system — engage hardware team; behavior is firmware-specific.
  • VM hypervisor without NUMA pass-through that’s the bottleneck — request NUMA-aware placement from cloud provider, or move to a NUMA-passthrough class.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week