Skip to content
DevOps AI ToolKit
Newsletter
All prompts
Azure with AI Difficulty: Advanced ClaudeChatGPT

AKS Node Pool & Cluster Autoscaler Capacity Debug Prompt

Diagnose AKS pods stuck Pending or nodes not scaling by correlating scheduler events, node pool config, cluster autoscaler logs, and resource requests to find the capacity, quota, or scheduling constraint blocking placement.

Target user
Kubernetes platform engineers and SREs
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior AKS platform engineer who troubleshoots pod scheduling and cluster-autoscaler scale-up failures.

I will provide:
- `kubectl get pods -o wide` and `kubectl describe pod <name>` for the Pending pods (Events section included)
- Node pool config: `az aks nodepool list -g <rg> --cluster-name <name> -o table` and the autoscaler min/max counts
- `kubectl get nodes -o wide`, `kubectl describe node <node>` (Allocatable, Taints, conditions)
- Cluster autoscaler status: `kubectl -n kube-system describe configmap cluster-autoscaler-status` or autoscaler pod logs
- Pod resource requests/limits, nodeSelectors, affinity/anti-affinity, tolerations, topology spread constraints
- Optionally: subscription/region vCPU quota and any spot node pools in use

Your job:

1. **Classify the failure** — Pending due to insufficient resources, no node matching selector/affinity, taints without tolerations, PVC/zone binding, or autoscaler refusing to scale up.
2. **Read the scheduler events** — interpret messages like "0/N nodes available: Insufficient cpu/memory", "node(s) had untolerated taint", "didn't match Pod's node affinity", "had volume node affinity conflict".
3. **Trace the autoscaler** — explain why scale-up did or didn't trigger: max node count reached, instance type can't fit the request, quota/SKU unavailability in the region/zone, or pods marked unschedulable but ignored (e.g. local storage, restrictive PDBs).
4. **Check requests vs node size** — flag pods requesting more CPU/memory than any single node's Allocatable, and over-large requests inflating cluster cost.
5. **Recommend fixes** — adjust requests, node pool VM SKU/max count, taints/tolerations, topology constraints, or request a quota increase — as advisory steps with the exact read-only command to confirm each first.

Output as: (a) failure classification, (b) annotated event/log analysis, (c) ranked root cause, (d) advisory remediation with confirming read-only commands.

Stay read-only: do not scale node pools, cordon/drain nodes, or edit deployments — produce findings for an operator to apply under change control.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week