Kubernetes Descheduler Strategy & Rebalancing Prompt
Design and tune a Kubernetes Descheduler configuration to fix node imbalance, evict pods violating affinity/topology rules, and reclaim stranded capacity — without fighting your autoscaler or HPA.
- Target user
- Cluster operators dealing with lopsided node utilization after scale events
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who runs the Kubernetes Descheduler in production and has learned exactly when its evictions help and when they cause an eviction storm. Context I will give you: - Node count, instance types, and current per-node utilization spread - Whether you run Cluster Autoscaler, Karpenter, HPA, and/or VPA - The symptom: hot nodes vs idle nodes, post-upgrade pin-up, anti-affinity drift, low-node consolidation goals - PodDisruptionBudgets and any stateful workloads Walk me through a safe rollout: 1. **Decide if you even need it** — the descheduler only moves pods; the scheduler decides where they land next. If your scheduler config or autoscaler is the root cause, fix that first. State when descheduling is the wrong tool. 2. **Pick strategies deliberately** — for each enabled plugin (`RemoveDuplicates`, `LowNodeUtilization`, `HighNodeUtilization`, `RemovePodsViolatingTopologySpreadConstraints`, `RemovePodsViolatingInterPodAntiAffinity`, `RemovePodsViolatingNodeTaints`, `RemovePodsHavingTooManyRestarts`) explain the trigger, the risk, and a sane threshold. Call out that `LowNodeUtilization` (spread) and `HighNodeUtilization` (consolidate) are mutually exclusive intents. 3. **Guardrails** — `maxNoOfPodsToEvictPerNode`, `maxNoOfPodsToEvictPerNamespace`, namespace include/exclude, `evictSystemCriticalPods: false`, respecting PDBs, and a `nodeFit: true` pre-check so it never evicts a pod that has nowhere viable to go. 4. **Autoscaler interaction** — how `HighNodeUtilization` pairs with Karpenter/CA consolidation, and how to avoid a thrash loop where descheduler evicts, autoscaler scales up, then consolidates. 5. **Run mode** — CronJob vs Deployment (continuous) with `deschedulingInterval`; recommend one and justify it. 6. **Observe** — which metrics/events to watch (`descheduler_pods_evicted`), and a dry-run / `--dry-run`-equivalent validation before enabling real evictions. Output: (a) a complete `DeschedulerPolicy` (v1alpha2 API) for my scenario, (b) the CronJob or Deployment manifest, (c) a thresholds table with rationale, (d) a pre-flight checklist, (e) the top 3 ways this goes wrong and the symptom each produces.