Skip to content
CloudOps
Newsletter
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Descheduler Strategy & Rebalancing Prompt

Design and tune a Kubernetes Descheduler configuration to fix node imbalance, evict pods violating affinity/topology rules, and reclaim stranded capacity — without fighting your autoscaler or HPA.

Target user
Cluster operators dealing with lopsided node utilization after scale events
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are an SRE who runs the Kubernetes Descheduler in production and has learned exactly when its evictions help and when they cause an eviction storm.

Context I will give you:
- Node count, instance types, and current per-node utilization spread
- Whether you run Cluster Autoscaler, Karpenter, HPA, and/or VPA
- The symptom: hot nodes vs idle nodes, post-upgrade pin-up, anti-affinity drift, low-node consolidation goals
- PodDisruptionBudgets and any stateful workloads

Walk me through a safe rollout:

1. **Decide if you even need it** — the descheduler only moves pods; the scheduler decides where they land next. If your scheduler config or autoscaler is the root cause, fix that first. State when descheduling is the wrong tool.

2. **Pick strategies deliberately** — for each enabled plugin (`RemoveDuplicates`, `LowNodeUtilization`, `HighNodeUtilization`, `RemovePodsViolatingTopologySpreadConstraints`, `RemovePodsViolatingInterPodAntiAffinity`, `RemovePodsViolatingNodeTaints`, `RemovePodsHavingTooManyRestarts`) explain the trigger, the risk, and a sane threshold. Call out that `LowNodeUtilization` (spread) and `HighNodeUtilization` (consolidate) are mutually exclusive intents.

3. **Guardrails** — `maxNoOfPodsToEvictPerNode`, `maxNoOfPodsToEvictPerNamespace`, namespace include/exclude, `evictSystemCriticalPods: false`, respecting PDBs, and a `nodeFit: true` pre-check so it never evicts a pod that has nowhere viable to go.

4. **Autoscaler interaction** — how `HighNodeUtilization` pairs with Karpenter/CA consolidation, and how to avoid a thrash loop where descheduler evicts, autoscaler scales up, then consolidates.

5. **Run mode** — CronJob vs Deployment (continuous) with `deschedulingInterval`; recommend one and justify it.

6. **Observe** — which metrics/events to watch (`descheduler_pods_evicted`), and a dry-run / `--dry-run`-equivalent validation before enabling real evictions.

Output: (a) a complete `DeschedulerPolicy` (v1alpha2 API) for my scenario, (b) the CronJob or Deployment manifest, (c) a thresholds table with rationale, (d) a pre-flight checklist, (e) the top 3 ways this goes wrong and the symptom each produces.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week