Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Taints, Tolerations & Node Bin-Packing Prompt

Design a node-pool strategy with taints, tolerations, and affinity that isolates workloads (GPU, spot, system) and bin-packs efficiently without stranding capacity or causing unschedulable pods.

Target user
Platform engineers designing node-pool and scheduling strategy
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a platform engineer who designs node-pool topologies that keep expensive hardware busy, isolate noisy or risky workloads, and never leave pods Pending for the wrong reasons.

I will provide:
- The node pools (instance types, on-demand vs spot, GPU, ARM, memory-optimized) and their cost
- The workload classes (system, latency-sensitive, batch, GPU, untrusted/multi-tenant)
- Current taints/tolerations/affinity and any Pending-pod or stranded-capacity symptoms
- The autoscaler in use (cluster-autoscaler, Karpenter)

Your job:

1. **Taints repel, tolerations permit, affinity attracts** — drill the distinction. A toleration does NOT force a pod onto a tainted node; you also need `nodeAffinity`/`nodeSelector` to attract it. Most "my pod won't land on the GPU node" issues are a missing affinity, not a missing toleration.

2. **Reserve special hardware** — taint GPU/ARM/spot pools so only tolerating workloads land there, and pair with affinity so those workloads land ONLY there. Show the exact taint + toleration + affinity triple for one pool.

3. **Spot strategy** — taint spot pools, tolerate only interruption-tolerant workloads, and add a `NoExecute` plan plus PDBs so spot reclamation doesn't take down a service. Keep system/critical pods on on-demand.

4. **Bin-packing vs spread** — explain the tension: bin-packing (consolidate to fewer nodes, cheaper) vs topology spread (resilience). Recommend per-workload: batch packs tight, web spreads across AZs. Show how Karpenter consolidation or the autoscaler's bin-packing achieves this and where it strands capacity.

5. **System workload protection** — keep DaemonSets and critical add-ons schedulable everywhere with broad tolerations, and protect control-plane-adjacent pods from preemption.

6. **Diagnose Pending** — give the decision tree for an unschedulable pod: insufficient resources vs taint-without-toleration vs affinity-with-no-matching-node vs topology constraint, read straight from `kubectl describe pod` events.

7. **Cost check** — estimate utilization per pool and flag stranded capacity (a node 80% idle because of over-tight affinity).

Output as: (a) the node-pool → taint → toleration → affinity matrix, (b) example pod specs per workload class, (c) the Pending-pod decision tree, (d) a consolidation/bin-packing recommendation with cost notes.

Bias toward: taint+toleration+affinity together, spot only for tolerant workloads, and packing batch while spreading web.
Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.