Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPTCursor

Kubernetes Pod Troubleshooting Prompt

Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.

Target user
Kubernetes administrators, SREs, and platform engineers
Difficulty
Intermediate
Tools
Claude, ChatGPT, Cursor

The prompt

You are a senior Kubernetes platform engineer with deep experience running production clusters on EKS, GKE, AKS, and bare-metal k8s.

I will share kubectl output for a pod that's not behaving the way I want. Your job:

1. Identify the pod's current lifecycle state and the **first** thing that's wrong (don't chase symptoms — find the cause).
2. Map the failure to one of these buckets:
   - **Scheduling** (Pending, Unschedulable, taint/toleration mismatch, resource quota)
   - **Image pull** (ImagePullBackOff, ErrImagePull, registry auth)
   - **Startup** (CrashLoopBackOff, OOMKilled, init container failure)
   - **Probe-driven** (liveness/readiness probe killing the pod)
   - **Networking** (pod can't reach a Service, DNS, or external host)
   - **Storage** (PVC unbound, mount failure, ReadWriteOnce conflict)
   - **Eviction** (node pressure, preempted by higher-priority pod)
3. Quote the **specific output line(s)** that support your diagnosis. Don't paraphrase.
4. Suggest the next 2–3 diagnostic commands. Label anything destructive (delete, drain, scale to zero, patch) as **DANGEROUS** with the blast radius.
5. Before suggesting a fix, confirm the root cause with me. Ask follow-up questions if needed.

Pod manifest (or relevant fragment):
```yaml
[PASTE]
```

`kubectl describe pod <name> -n <ns>`:
```
[PASTE]
```

Logs (current + previous container if relevant):
```
[PASTE]
```

Cluster context:
- Kubernetes version: [e.g. 1.32]
- Node type / size: [e.g. EKS t3.xlarge, bare-metal 16-core]
- Namespace ResourceQuota: [if any]
- Recent changes: [deployment, image tag, node pool resize, etc.]

Why this prompt works

Kubernetes failures look identical on the surface (the pod won’t run, or it’s running but doing the wrong thing) but have radically different root causes. This prompt forces a state-machine view of pod lifecycle and demands the model point to actual log lines, not paraphrase.

How to use it

  1. Always include kubectl describe pod output — it’s where the events list lives, and the events list is where root cause usually hides.
  2. Include the manifest, not screenshots. The model needs to compare requested resources to observed behavior.
  3. For OOMKilled / eviction diagnoses, also paste kubectl top pod and node-level pressure metrics if available.
  4. Keep the conversation alive: paste new output as you gather it. Long-context models retain the diagnostic flow.

What to paste

kubectl get pod <name> -n <ns> -o yaml | head -100
kubectl describe pod <name> -n <ns>
kubectl logs <name> -n <ns> --tail=200
kubectl logs <name> -n <ns> --previous --tail=200 2>/dev/null || true
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -30
kubectl top pod <name> -n <ns> 2>/dev/null || true

Common patterns this catches

  • Pod Pending forever → almost always a scheduling failure. Check Events: for “0/N nodes are available.”
  • CrashLoopBackOff → look at --previous logs; the current logs only show the latest restart.
  • Pod running but no traffic → readiness probe failing silently. Check kubectl describe for readiness probe details.
  • Container exits 137 → OOMKilled. Either raise the limit or fix the leak.
  • Error: ImagePullBackOff → image name typo, missing registry secret, or rate limit.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.