Troubleshooting AKS With AI: From CrashLoopBackOff to Root

A deployment that worked in three other clusters wouldn’t schedule on a new AKS cluster. The pods sat in Pending. kubectl describe showed 0/3 nodes are available: 3 Insufficient cpu — except the nodes were nearly idle. The real story was two layers down: a LimitRange in the namespace was injecting a CPU request far larger than anyone wrote, and the node pool VM SKU was too small to satisfy it. Three pieces of evidence in three different places, and the error message pointed at none of them.

That’s AKS troubleshooting in a sentence. The failure surfaces in Kubernetes, but the cause can live in the Azure node pool, the CNI, an admission controller, or the managed control plane you can’t even SSH into. The evidence is spread across kubectl, az aks, and node-level logs. AI is genuinely good at the synthesis step — taking events from describe, logs from the container, and node conditions, and telling you which thread to pull. It does not run your cluster. You run the commands; it connects the dots.

Start where Kubernetes tells you, then widen

For a crashing pod, the first three commands are always the same, and the events at the bottom of describe are worth more than the logs:

kubectl describe pod "$POD" -n "$NS"            # events, probes, scheduling
kubectl logs "$POD" -n "$NS" --previous          # the crashed container, not the restart
kubectl get events -n "$NS" --sort-by='.lastTimestamp' | tail -30

When the events plus logs run long, paste both into AI together — the combination is what produces a diagnosis:

Prompt: “Here are kubectl describe pod events and the --previous container logs for a pod in CrashLoopBackOff. Correlate them: what is the kill reason (OOMKilled, liveness probe, exit code), and what does the application log show in the seconds before it died? Give me the single most likely root cause and the one command that would confirm it.”

The thing AI does well here is correlation across sources. An OOMKilled in the events plus a slow memory climb in the logs is a leak, not an under-sized limit, and the fix is different. A human gets there too, but slower, especially when the events are forty lines of probe noise.

Node pool problems wear Kubernetes costumes

Pending pods, mysterious evictions, and Insufficient cpu on idle nodes are usually Azure-side. Check the node pool shape and the node conditions before you touch your manifests:

az aks nodepool list --cluster-name "$AKS" --resource-group "$RG" \
  --query "[].{name:name, sku:vmSize, count:count, max:maxCount, mode:mode, prov:provisioningState}" -o table

kubectl describe nodes | grep -A5 "Conditions:"
kubectl get nodes -o custom-columns='NODE:.metadata.name,ALLOCATABLE_CPU:.status.allocatable.cpu,ALLOCATABLE_MEM:.status.allocatable.memory'

The gap between a node’s capacity and its allocatable trips people up — the kubelet, OS, and system pods reserve a real slice, so a 2-vCPU node doesn’t give you 2 vCPU. Feed the numbers to AI:

Prompt: “A node is a Standard_D2s_v5 (2 vCPU, 8 GiB). Allocatable CPU shows 1900m and allocatable memory 5.8Gi. My deployment requests 1 CPU and 2Gi per pod, replicas 3, with a LimitRange in the namespace. Explain why only one pod schedules per node, account for the reservation gap, and tell me whether to change the request, the LimitRange, or the node SKU.”

This is where AI shines: the math is deterministic and it’ll catch the LimitRange interaction that the bare scheduler error hides.

CNI and networking errors need decoding, not guessing

Azure CNI failures are some of the most cryptic AKS errors — FailedCreatePodSandBox, IP exhaustion on the subnet, or Plugin returned error. With Azure CNI, every pod consumes a real VNet IP, so a subnet that’s too small silently caps your pod count long before you hit a node limit.

# How many IPs does the AKS subnet actually have, and how many are free?
az network vnet subnet show --vnet-name "$VNET" --name "$SUBNET" --resource-group "$RG" \
  --query "{prefix:addressPrefix, ipConfigs:length(ipConfigurations)}" -o json

kubectl get events -A --field-selector reason=FailedCreatePodSandBox

Paste a raw sandbox-creation error to AI and ask it to translate:

Prompt: “This AKS pod failed with FailedCreatePodSandBox ... plugin type=azure-vnet failed (add): Failed to allocate address. Given Azure CNI assigns a VNet IP per pod, explain the cause, how to confirm subnet IP exhaustion, and the two real fix options (bigger subnet vs. switching to CNI Overlay), with the trade-offs.”

The CNI error vocabulary is small and well-documented, which is exactly why AI translates it reliably — these aren’t novel failures, they’re the same dozen errors the platform emits. You confirm the IP count; the model names the failure.

Use AKS’s own diagnostics and Container Insights

AKS ships a managed diagnostics command and, if you’ve enabled Container Insights, a KQL-queryable log store. Run the built-in detectors first:

az aks check-acr --name "$AKS" --resource-group "$RG" --acr "$ACR.azurecr.io"
az aks show --name "$AKS" --resource-group "$RG" \
  --query "{power:powerState.code, k8s:kubernetesVersion, net:networkProfile.networkPlugin}" -o json

If Container Insights is on, this KQL finds the noisiest restart offenders across the cluster — far faster than eyeballing kubectl get pods:

KubePodInventory
| where TimeGenerated > ago(1h)
| where PodRestartCount > 0
| summarize MaxRestarts = max(PodRestartCount) by Name, Namespace, ContainerName
| top 15 by MaxRestarts desc

When you don’t know the schema, that’s a fine thing to ask AI to draft — “write KQL against KubePodInventory to show pods restarting more than five times in the last hour, grouped by namespace.” Then sanity-check the column names against your workspace, because AI occasionally guesses a field that doesn’t exist in your table version.

Keep your hands on the wheel

The discipline that makes this safe: AI reads and hypothesizes, you verify and act. Don’t let it talk you into kubectl delete pod as a “fix” before you understand why the pod died — you’ll just lose the evidence and meet the same failure in an hour. The loop that works is tight: gather events plus logs plus node state, let AI correlate them into one root-cause hypothesis, run the single confirming command it suggests, and only then change one thing. AKS is a managed product, which means the failures cluster around a known set of platform behaviors — IP-per-pod CNI, allocatable reservations, admission-controller surprises — and that finite, well-documented surface is precisely what makes an LLM a useful reader of the mess.

I keep my AKS triage prompts in the prompts library, and there’s more Azure operations material in the Azure category. The cluster is yours to run; let the model do the reading across the three places the answer is hiding.

Troubleshooting AKS With AI: From CrashLoopBackOff to Root Cause

Start where Kubernetes tells you, then widen

Node pool problems wear Kubernetes costumes

CNI and networking errors need decoding, not guessing

Use AKS’s own diagnostics and Container Insights

Keep your hands on the wheel

Download the Free 500-Prompt DevOps AI Toolkit