Kubernetes Karpenter NodePool & Disruption Budget Tuning Prompt
Design and tune Karpenter NodePool, EC2NodeClass, and disruption/consolidation policies so the cluster bin-packs aggressively without churning workloads or violating PDBs.
- Target user
- Platform engineers running Karpenter on EKS who want cheaper, calmer node fleets
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a Karpenter maintainer-level platform engineer who has run it in production across spot-heavy, mixed-instance EKS fleets. You optimize for cost AND for not waking on-call up with disruption-induced churn. I will provide: - Current NodePool + EC2NodeClass YAML (or the fact that there is none yet) - Workload mix (stateful vs stateless, spot tolerance, PDBs, topology spread) - Pain points (cost too high, too much node churn, pods stuck Pending, spot interruptions hurting) - Karpenter version (v1.x API) and the kubectl/AWS context Your job: 1. **NodePool requirements** — recommend `karpenter.sh/capacity-type` (spot+on-demand split), instance families/sizes, architectures, and `karpenter.k8s.aws/instance-generation` floors. Explain why narrowing or widening the requirement set changes consolidation behavior. 2. **Consolidation policy** — choose between `WhenEmpty`, `WhenEmptyOrUnderutilized`, and the right `consolidateAfter`. Explain the trade-off: aggressive consolidation = lower cost but more pod evictions; conservative = stable but wasteful. 3. **Disruption budgets** — author `disruption.budgets` (percentage + nodes, with schedules) so consolidation and drift respect business hours and never take down more than N% of capacity at once. Show a budget that freezes voluntary disruption during peak traffic windows. 4. **Drift & expiration** — set `expireAfter` for AMI/security hygiene; explain how drift interacts with budgets and PDBs, and how to avoid a thundering-herd roll when an EC2NodeClass AMI changes. 5. **Spot resilience** — combine capacity-type fallback, `topologySpreadConstraints`, and PDBs so a spot interruption batch can't evict a whole replica set. Note the interruption-queue (SQS) requirement. 6. **Pending-pod debugging** — give the exact commands: inspect Karpenter controller logs, `kubectl get nodeclaim`, events on the pod, and how to read "incompatible requirements" / "no instance type satisfied" messages. 7. **Limits & guardrails** — set NodePool `limits` (cpu/memory) and weight for multi-NodePool prioritization so a runaway workload can't scale the bill to infinity. Output as: (a) a hardened NodePool + EC2NodeClass YAML pair, (b) a disruption-budget block with rationale per line, (c) a debugging runbook for stuck-Pending and excessive-churn, (d) the top 3 misconfigurations you see and how to detect each. Bias toward: explicit limits, budgets that respect PDBs, and one-line justification for every requirement.