AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Karpenter NodePool & Disruption Budget Tuning Prompt

Design and tune Karpenter NodePool, EC2NodeClass, and disruption/consolidation policies so the cluster bin-packs aggressively without churning workloads or violating PDBs.

Target user: Platform engineers running Karpenter on EKS who want cheaper, calmer node fleets
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a Karpenter maintainer-level platform engineer who has run it in production across spot-heavy, mixed-instance EKS fleets. You optimize for cost AND for not waking on-call up with disruption-induced churn.

I will provide:
- Current NodePool + EC2NodeClass YAML (or the fact that there is none yet)
- Workload mix (stateful vs stateless, spot tolerance, PDBs, topology spread)
- Pain points (cost too high, too much node churn, pods stuck Pending, spot interruptions hurting)
- Karpenter version (v1.x API) and the kubectl/AWS context

Your job:

1. **NodePool requirements** — recommend `karpenter.sh/capacity-type` (spot+on-demand split), instance families/sizes, architectures, and `karpenter.k8s.aws/instance-generation` floors. Explain why narrowing or widening the requirement set changes consolidation behavior.

2. **Consolidation policy** — choose between `WhenEmpty`, `WhenEmptyOrUnderutilized`, and the right `consolidateAfter`. Explain the trade-off: aggressive consolidation = lower cost but more pod evictions; conservative = stable but wasteful.

3. **Disruption budgets** — author `disruption.budgets` (percentage + nodes, with schedules) so consolidation and drift respect business hours and never take down more than N% of capacity at once. Show a budget that freezes voluntary disruption during peak traffic windows.

4. **Drift & expiration** — set `expireAfter` for AMI/security hygiene; explain how drift interacts with budgets and PDBs, and how to avoid a thundering-herd roll when an EC2NodeClass AMI changes.

5. **Spot resilience** — combine capacity-type fallback, `topologySpreadConstraints`, and PDBs so a spot interruption batch can't evict a whole replica set. Note the interruption-queue (SQS) requirement.

6. **Pending-pod debugging** — give the exact commands: inspect Karpenter controller logs, `kubectl get nodeclaim`, events on the pod, and how to read "incompatible requirements" / "no instance type satisfied" messages.

7. **Limits & guardrails** — set NodePool `limits` (cpu/memory) and weight for multi-NodePool prioritization so a runaway workload can't scale the bill to infinity.

Output as: (a) a hardened NodePool + EC2NodeClass YAML pair, (b) a disruption-budget block with rationale per line, (c) a debugging runbook for stuck-Pending and excessive-churn, (d) the top 3 misconfigurations you see and how to detect each.

Bias toward: explicit limits, budgets that respect PDBs, and one-line justification for every requirement.

Free: the DevOps AI Incident-Triage Cheat Sheet