You are a senior performance engineer who has tuned NUMA-aware workloads on dual-socket and quad-socket servers. You can read `numastat` and `lscpu` to spot cross-node traffic that's killing memory bandwidth. I will provide: - The symptom (DB latency p99 spike, throughput plateau, inconsistent benchmark results) - Hardware: socket count, cores per socket, NUMA topology (`numactl --hardware`, `lscpu | grep -i numa`) - Workload: single-process (DB, JVM) or many small processes? Threaded? Memory-resident size? - Output of `numastat -m` (memory per node) and `numastat -p <pid>` (per-process) - `/sys/fs/cgroup/.../cpuset.cpus` and `cpuset.mems` if cgroup-pinned Your job: 1. **Map the topology**: how many NUMA nodes? Which CPUs belong to which node? How is memory distributed? Are there CPU-less nodes (rare; HBM systems)? 2. **Identify imbalance**: - **`numastat -m` columns** = per-node; check `MemFree`, `MemUsed`, `HugePages_Free`. Asymmetric usage hints at allocation drift. - **`numastat -p <pid>`** shows per-process breakdown of memory usage per node. A single-threaded process using memory from a node OTHER than where it's running = cross-node access. - **`numa_miss` / `numa_foreign`** counters (in `numastat -m`) — non-zero means processes ran out of local memory and allocated remotely. 3. **Common NUMA pathologies**: - **Memory not pinned** → kernel allocates from local node at first-touch; if thread migrates, the memory is now remote. Pin threads or pre-fault on the right node. - **First-touch policy + early init** → init thread on node 0 touches all pages; entire memory is on node 0; worker threads on node 1 access remotely. - **Memory mirroring (interleave)** wasted on workloads that fit one node → use bind or preferred. - **NUMA-unaware DB (e.g., PostgreSQL pre-NUMA-aware versions)** → run with `numactl --interleave=all` for predictability. - **JVM heap larger than one node** → heap spans nodes; G1/ZGC threads may access remote; use `+UseNUMA` (G1) for awareness. - **VM in a cloud with NUMA-pass-through** → tune like bare metal. 4. **For each suspect process** recommend the pinning strategy: - **`numactl --membind=N --cpunodebind=N <cmd>`** — strict binding (single-node) - **`numactl --interleave=all <cmd>`** — round-robin allocation (good for memory-heavy DB that doesn't easily partition) - **`numactl --preferred=N`** — prefer node N, fall back to others - **cgroup `cpuset`** for per-container binding - **`taskset -c` + numactl** — when thread affinity matters 5. **For multi-instance setups** (e.g., 2 DB instances on a 2-socket box): - One instance per node, each `numactl`-bound to its node — usually outperforms a single instance spanning sockets. 6. **Verify the fix** with `numastat -p <pid>` after pinning — `Total` should be concentrated on the bound node(s). 7. **`vm.zone_reclaim_mode` warning**: on some kernels, set to 1 causes aggressive same-node reclamation that slows allocation. Default 0 is usually right; verify on your workload. Mark DESTRUCTIVE: re-pinning a live database (may cause memory migration during a `numa_balancing` sweep), disabling `numa_balancing` cluster-wide. --- Hardware: ``` [PASTE `numactl --hardware`] ``` Symptom + measurement: [DESCRIBE] Workload context: [single process / many; threaded; memory size] `numastat -m`: ``` [PASTE] ``` `numastat -p <pid>`: ``` [PASTE for relevant pid] ``` Current pinning (if any): [DESCRIBE]

Why this prompt works

NUMA effects are invisible on top and vmstat — they show up as slow application performance with idle CPU and unexplained latency p99. numastat is the only place where they’re obvious. This prompt forces a topology-aware walk and proposes specific pinning rather than vague advice.

How to use it

Always include numactl --hardware — it shows nodes, CPUs per node, and inter-node distances.
Single-process vs many-process workloads need different strategies — call out which.
Capture numastat -p <pid> before AND after any pinning change — that’s the proof.
Mention if it’s a VM — many cloud VMs are single-NUMA-node (UMA); tuning may be unnecessary.

Useful commands

# Topology
numactl --hardware
lscpu | grep -i numa
ls /sys/devices/system/node/
cat /sys/devices/system/node/node*/distance

# Per-node memory
numastat -m
free -h    # global view doesn't show per-node

# Per-process memory by node
numastat -p <pid>
cat /proc/<pid>/numa_maps | head    # per-mapping detail

# Per-node counters (kernel)
cat /sys/devices/system/node/node0/numastat   # numa_hit/miss/foreign

# Pinning
numactl --hardware
numactl --cpunodebind=0 --membind=0 ./app
numactl --interleave=all ./db-server
numactl --preferred=1 ./app
taskset -c 0-15 ./app                # CPU subset (cores 0-15)
sched_setaffinity equivalent via taskset -p

# Cgroup-level (containers)
echo 0 | sudo tee /sys/fs/cgroup/<slice>/cpuset.mems
echo 0-15 | sudo tee /sys/fs/cgroup/<slice>/cpuset.cpus

# Verify pinning took
cat /proc/<pid>/status | grep -i node
numastat -p <pid>

# Migration tracking (kernel)
cat /proc/<pid>/sched | grep numa
cat /sys/kernel/debug/sched_debug | head    # detailed sched view

# Automatic NUMA balancing
cat /proc/sys/kernel/numa_balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing   # disable (with care)

Pinning patterns

Single big-memory process (database)

# Option A: bind to one node (if memory fits)
numactl --cpunodebind=0 --membind=0 postgres -D /data

# Option B: interleave across all nodes (memory > one node, fair across)
numactl --interleave=all postgres -D /data

Multiple instances, one per node

# Instance 0 on node 0
numactl --cpunodebind=0 --membind=0 mysqld --port=3306 --datadir=/data/inst0
# Instance 1 on node 1
numactl --cpunodebind=1 --membind=1 mysqld --port=3307 --datadir=/data/inst1

systemd unit pinning

[Service]
ExecStart=/usr/bin/myapp
CPUAffinity=0-15
NUMAPolicy=bind
NUMAMask=0

Container (Docker/Podman)

docker run --cpuset-cpus=0-15 --cpuset-mems=0 myapp:latest
podman run --cpuset-cpus=0-15 --cpuset-mems=0 myapp:latest

# Kubernetes — requires CPUManager static policy + Topology Manager
# kubelet config:
#   cpuManagerPolicy: static
#   topologyManagerPolicy: single-numa-node

Common findings this catches

High numa_foreign counter on node 0 → cross-node allocations; pin or interleave.
Single-threaded init touches all pages on node 0 → all memory on node 0; worker threads on node 1 access remotely. Solution: per-thread first-touch or interleave.
JVM with -Xmx32g on a 2-socket box with 16g per node → heap spans nodes. Use -XX:+UseNUMA for G1, or split into two JVMs.
Container with no cpuset.mems but cpuset.cpus=0-15 → memory free-for-all; pin both.
vm.zone_reclaim_mode=1 on a DB host → known bad; revert to 0.
Cloud VM showing 1 NUMA node → no benefit from pinning; remove numactl overhead.

Verifying with hardware counters

# perf — measure cache misses suggesting NUMA cross-traffic
sudo perf stat -e \
  cache-misses,cache-references,\
  node-loads,node-load-misses,\
  node-stores,node-store-misses \
  -p <pid> sleep 30

High node-load-misses / node-loads ratio = significant remote-node access.

When to escalate

Kernel numa_balancing causing periodic latency spikes — coordinated with platform team; consider disabling.
Memory hot-add / hot-remove on a NUMA system — engage hardware team; behavior is firmware-specific.
VM hypervisor without NUMA pass-through that’s the bottleneck — request NUMA-aware placement from cloud provider, or move to a NUMA-passthrough class.

Linux NUMA Imbalance Investigation Prompt

Why this prompt works

How to use it

Useful commands

Pinning patterns

Single big-memory process (database)

Multiple instances, one per node

systemd unit pinning

Container (Docker/Podman)

Common findings this catches

Verifying with hardware counters

When to escalate

Related prompts

Linux Block I/O Performance Investigation Prompt

Linux Context Switch & Lock Contention Diagnosis Prompt

Linux High Load & CPU Runaway Investigation Prompt

Why this prompt works

How to use it

Useful commands

Pinning patterns

Single big-memory process (database)

Multiple instances, one per node

systemd unit pinning

Container (Docker/Podman)

Common findings this catches

Verifying with hardware counters

When to escalate

Related prompts

Linux Block I/O Performance Investigation Prompt

Linux Context Switch & Lock Contention Diagnosis Prompt

Linux High Load & CPU Runaway Investigation Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet