Kubernetes Topology Spread Constraints Debug Prompt
Diagnose and design topology spread constraints — zone/node distribution, skew tolerance, hard vs soft, single-zone cluster traps.
- Target user
- Kubernetes engineers ensuring HA workload placement
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who has designed multi-zone HA workload placement with `topologySpreadConstraints`. You know that misconfigured spread (`DoNotSchedule` + tiny cluster + low maxSkew) is a self-induced FailedScheduling. I will provide: - The workload (Deployment/StatefulSet name) - Cluster topology — zones, node count per zone (`kubectl get nodes --show-labels | grep topology.kubernetes.io/zone`) - Current pod placement (`kubectl get pods -l <selector> -o wide`) - The `topologySpreadConstraints` block from the pod spec - The symptom: pods stuck Pending, uneven distribution, scaling causes Pending Your job: 1. **Decode the constraint**: - **`topologyKey`** — the node label defining the bucket (e.g., `topology.kubernetes.io/zone` for AZ, `kubernetes.io/hostname` for node) - **`maxSkew`** — max difference in pod count between buckets - **`whenUnsatisfiable`** — `DoNotSchedule` (hard; rejects pod if violated) or `ScheduleAnyway` (soft; prefers but allows) - **`labelSelector`** — pods to count when computing skew - **`minDomains`** (1.27+) — minimum number of buckets that must exist; useful for new clusters where zones haven't been used yet 2. **Compute the current skew**: - Group pods by topology label value - Skew = max(group counts) - min(group counts) among labeled-matching pods - Adding a new pod: which bucket does the scheduler prefer? The smallest one that satisfies all other filters 3. **For "pods Pending after scale-up"**: - Current spread already at maxSkew; new pod increases skew further → blocked - Single-zone cluster + `topologyKey: zone` + `maxSkew: 1, DoNotSchedule` → only 1 pod can ever schedule (skew between "zone-A:1" and "no-zone:0" exceeds limit) - Fix: `ScheduleAnyway`, raise `maxSkew`, or add more zones 4. **For "skew higher than expected"**: - `nodeAffinity` excluding nodes from a bucket → fewer schedulable nodes there - `whenUnsatisfiable: ScheduleAnyway` lets the scheduler exceed maxSkew under pressure - Other constraints conflict: `podAntiAffinity` taking precedence 5. **For combining multiple constraints**: - Multiple `topologySpreadConstraints` entries all apply - Common: spread across zones AND nodes (`zone` constraint + `hostname` constraint) - All must be satisfied (effectively AND) 6. **For interaction with HPA**: - HPA scales replicas; new pods must fit the spread - Going from 3 → 4 replicas with 3 zones: where does pod 4 go? Any zone; skew goes 2-1-1. - Going from 4 → 5: another zone; skew 2-2-1, max skew 1. OK if maxSkew ≥ 1. 7. **For init / migration patterns**: - **`minDomains`** ensures enough zones exist (avoids 1-zone init that locks future spread) - For day-1 single-zone clusters that plan to be multi-zone: start with `ScheduleAnyway`, switch to `DoNotSchedule` after expanding Mark DESTRUCTIVE: changing `maxSkew` live (existing pods don't move; new pods may face unexpected constraints), `whenUnsatisfiable: DoNotSchedule` without verifying buckets exist (locks future scheduling). --- Workload: [Deployment/StatefulSet + namespace] Current pod count + intended replicas: [DESCRIBE] Cluster zone topology: [PASTE `kubectl get nodes -L topology.kubernetes.io/zone`] Pods now (with zone): [PASTE `kubectl get pods -l <selector> -o wide`] Spread constraints from pod spec: ```yaml [PASTE topologySpreadConstraints] ``` Symptom: [DESCRIBE]
Why this prompt works
Topology spread is powerful but misconfigured spread = “pods Pending forever.” The cluster might look big, but if the spread requires zones you don’t have, scheduling refuses. This prompt enforces a topology-aware diagnosis.
How to use it
- Confirm the cluster’s topology BEFORE designing constraints. Count actual zones / nodes.
- Start with
ScheduleAnywayin production; flip toDoNotScheduleonly after verifying placement works. - For multi-zone clusters, use zone constraints; for single-zone, use node (hostname) constraints.
- Combine carefully with anti-affinity — both impose constraints.
Useful commands
# Cluster topology
kubectl get nodes --show-labels | grep -oE 'topology.kubernetes.io/zone=[a-z0-9-]+'
kubectl get nodes -o json | jq '.items[] | {name:.metadata.name, zone:.metadata.labels["topology.kubernetes.io/zone"]}'
# Count nodes per zone
kubectl get nodes -L topology.kubernetes.io/zone | awk '{print $NF}' | sort | uniq -c
# Pod placement
kubectl get pods -l <selector> -o wide --sort-by=.spec.nodeName
kubectl get pods -l <selector> -o json | \
jq '.items[] | {name:.metadata.name, node:.spec.nodeName, zone:.metadata.labels["topology.kubernetes.io/zone"] // "none"}'
# Per-zone pod count
kubectl get pods -l <selector> -o json | \
jq -r '.items[].spec.nodeName' | \
while read n; do kubectl get node $n -o jsonpath='{.metadata.labels.topology\.kubernetes\.io/zone}'; echo; done | \
sort | uniq -c
# Scheduler decisions
kubectl get events --field-selector reason=FailedScheduling | head
# Test changes safely
kubectl patch deploy <name> --type='strategic' -p '...' # in staging first
Patterns
Zone spread + soft
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels: { app: web }
Strict zone spread (multi-zone cluster required)
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
minDomains: 3 # require 3 zones
labelSelector:
matchLabels: { app: web }
Zone + node spread (HA per zone AND per node)
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: { app: db }
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: { app: db }
(Combine with pod anti-affinity for “never two pods on the same node” if needed.)
Single-zone cluster (only node spread)
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels: { app: web }
Cluster-wide default spread (1.24+)
# kube-scheduler config (cluster admin only)
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 3
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
Common findings this catches
whenUnsatisfiable: DoNotSchedule+maxSkew: 1+ single-zone cluster → only 1 pod schedules; rest Pending.labelSelectordoesn’t match the pods you intend → spread is computed over wrong group.- Multiple constraints all
DoNotSchedule→ frequent FailedScheduling; loosen one toScheduleAnyway. - Skew increases after HPA scale → expected; if exceeds maxSkew, pods Pending.
minDomains: Nin single-N-1-zone cluster → never schedules.- Existing imbalance after node add — topology spread doesn’t rebalance existing pods; do
kubectl rollout restart deploy <name>. nodeAffinityexcludes some zones → spread can’t use them; pods congregate in remaining zones.
When to escalate
- Cluster topology issues (no zone labels on nodes) — engage cluster admin; cloud node provisioning should set these.
- Frequent FailedScheduling in HA-critical workloads — review entire scheduling decision; topology spread may not be the only constraint.
- Scheduling profile customization (cluster-wide defaults) — coordinate with cluster admin; affects every workload.
Related prompts
-
Kubernetes Cluster Autoscaler / Karpenter Debug Prompt
Diagnose cluster autoscaling — scale-up delay, scale-down protection, node group selection, pod doesn't fit any template, Karpenter NodePool/NodeClaim issues.
-
Kubernetes Deployment Rollout Debug Prompt
Diagnose stuck Deployment rollouts — `ProgressDeadlineExceeded`, replica set churn, maxSurge/maxUnavailable misconfig, image pull pacing, and stuck-mid-rollout recovery.
-
Kubernetes `FailedScheduling` Debug Prompt
Diagnose `FailedScheduling` events — taints/tolerations mismatch, node affinity, topology spread skew, resource fit failures, and PV zone constraints.