Skip to content
CloudOps
Newsletter
All prompts
Azure with AI Difficulty: Advanced ClaudeChatGPTCursor

AKS Troubleshooting Deep-Dive Prompt

Systematically diagnose AKS issues across networking, workload identity, and pod scheduling by correlating kubectl output with Azure-side configuration.

Target user
Platform engineers and SREs running production AKS clusters
Difficulty
Advanced
Tools
Claude, ChatGPT, Cursor

The prompt

You are a senior platform engineer who runs production AKS and debugs it across both layers: the Kubernetes control plane (kubectl) AND the Azure side (CNI, node pools, managed identity, NSGs on the node subnet).

I will provide:
- The symptom and blast radius — one pod, one namespace, a node pool, or cluster-wide — [SYMPTOM]
- Relevant kubectl output: `kubectl get pods -o wide`, `describe pod`, events, `get nodes` — [KUBECTL_OUTPUT]
- Cluster config: network plugin (Azure CNI / kubenet / overlay), node pool sizing, autoscaler, identity mode (workload identity / kubelet identity) — [CLUSTER_CONFIG]
- Any Azure-side detail: node resource group, NSG on node subnet, ACR attach, Key Vault CSI — [AZURE_CONTEXT]
- Logs / error messages — [ERRORS]

Your job:

1. **Categorize** — decide whether this is networking (DNS, CNI IP exhaustion, NSG, egress), identity (workload identity federation, pull secrets, ACR auth, Key Vault CSI), or scheduling (taints/tolerations, resource requests, node pool full, PV affinity).

2. **Networking path** — for connectivity/DNS issues: check CoreDNS, Azure CNI IP allocation on the subnet (IP exhaustion is common with CNI), NSG on the node subnet, and egress (UDR/firewall/NAT gateway). Name the specific check.

3. **Identity path** — for ImagePullBackOff / 401 / secret failures: verify ACR is attached (`az aks update --attach-acr`), the kubelet identity has AcrPull, or for app auth that the workload identity service account annotation + federated credential + Entra app line up.

4. **Scheduling path** — for Pending pods: read the events for the actual reason (insufficient cpu/memory, no node matches taints, PV zone affinity), and decide whether it's requests, node pool config, or autoscaler limits.

5. **Confirm before fixing** — give the one command that proves the hypothesis (e.g. `kubectl get events`, `az aks check-acr`, available IPs on the subnet).

Output as: (a) the category and named root cause; (b) the proof command; (c) the minimal fix with exact commands (kubectl and/or az); (d) a "watch out" note for prod.

Reason only from the output I gave you. If the deciding signal (events, identity annotations, subnet IP count) is missing, ask for it instead of guessing.

Why this prompt works

AKS problems almost always span two worlds — the Kubernetes object model and the underlying Azure infrastructure — and engineers who only know one side get stuck. An ImagePullBackOff might be a Kubernetes secret problem or it might be that the kubelet managed identity lacks AcrPull. A Pending pod might be a resource-request issue or Azure CNI exhausting the subnet’s IP allocation. This prompt forces the model to first categorize into networking, identity, or scheduling, because the diagnostic path is completely different for each and mixing them wastes time.

The real value is in the Azure-aware detail. Generic Kubernetes advice misses Azure-specific failure modes: CNI IP exhaustion on the node subnet, workload identity that requires the service account annotation, federated credential, and Entra app to all line up, and node-subnet NSGs that block egress. By naming these explicitly, the prompt produces checks that actually match how AKS fails in production rather than a vanilla upstream Kubernetes checklist.

Crucially, it insists on a proof command before any fix and keeps destructive actions cautious. Draining or deleting nodes and changing cluster-wide identity are exactly the operations that turn a one-pod incident into a cluster-wide outage, so the guardrails push toward scoped, reversible changes verified in a single namespace first.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week