Debugging CrashLoopBackOff and Pending Pods Faster With AI
CrashLoopBackOff and Pending are the two failure states every Kubernetes operator hits weekly. Here's a systematic way to debug both, with AI handling the tedious log reading.
- #kubernetes
- #troubleshooting
- #ai
- #pods
- #debugging
- #sre
In twenty-five years of running production systems, two pod states have eaten more of my afternoons than any others: CrashLoopBackOff and Pending. They look scary, they’re vague, and the actual cause is almost always one of a small handful of things. The trick is having a checklist — and letting AI read the wall of logs while you think.
This is the systematic approach I use, plus where an AI assistant genuinely saves time.
First: read the state, don’t guess
Before you touch anything, get the facts. Ninety percent of the diagnosis is in three commands:
kubectl get pod my-pod -o wide
kubectl describe pod my-pod
kubectl logs my-pod --previous
describe gives you the events and the container state. logs --previous gives you the output of the container that just died — which is the one you actually care about in a crash loop. People forget --previous constantly and end up reading the logs of a container that hasn’t started yet.
CrashLoopBackOff: the container starts and dies
CrashLoopBackOff means the kubelet started your container, it exited, and Kubernetes is backing off before retrying. The pod isn’t broken — your process is exiting. Look at the exit code in describe:
- Exit 0 — your process ran and finished. Usually a missing long-running command, or an entrypoint that isn’t actually a server.
- Exit 1 / 2 — application error. Read the logs.
- Exit 137 — OOMKilled. Your container hit its memory limit. Check
State.Reason. - Exit 139 — segfault (SIGSEGV).
- Exit 143 — SIGTERM, often a failing liveness probe killing the container.
That last one is the sneaky one. A liveness probe that’s too aggressive will kill a healthy-but-slow-starting app forever. Check your initialDelaySeconds and failureThreshold.
Where AI earns its keep
Paste the describe output and the --previous logs into your assistant and ask:
“This pod is in CrashLoopBackOff. Here’s the describe output and the previous container logs. What’s the exit code, what’s the most likely cause, and what’s the single read-only command to confirm it?”
The model is fast at correlating a stack trace with an exit code and a probe config — the boring cross-referencing you’d do by hand. Keep a library of these Kubernetes prompts so you’re not authoring them mid-incident.
Pending: the pod can’t be scheduled
Pending means the scheduler can’t place the pod on any node. The reason is always in the events:
kubectl describe pod my-pod | grep -A10 Events
Common causes, in rough order of frequency:
1. Insufficient resources
0/5 nodes are available: insufficient cpu — your requests are bigger than any node’s free capacity. Check what you asked for:
kubectl get pod my-pod -o jsonpath='{.spec.containers[*].resources}'
kubectl top nodes
Either lower the request or add capacity. A surprising number of “the cluster is broken” tickets are a request typo — 2 CPU instead of 200m.
2. Node selectors and affinity
node(s) didn't match node selector. Your pod demands a label no node has. Check nodeSelector, affinity, and tolerations against your actual node labels with kubectl get nodes --show-labels.
3. Taints with no toleration
Control-plane nodes and GPU nodes are commonly tainted. If every available node is tainted and your pod tolerates none of them, it stays Pending forever.
4. PVC not bound
pod has unbound immediate PersistentVolumeClaims. The pod is waiting on storage that doesn’t exist. Check kubectl get pvc — if it’s also Pending, your StorageClass or provisioner is the real problem.
A debugging loop that actually converges
Here’s the loop I run, every time, regardless of which state I’m in:
- Get the events.
describefirst, always. - Classify the failure. Crash (process dies) or schedule (can’t place)?
- Read the right logs.
--previousfor crashes. - Form one hypothesis. Not five. The most likely one from the events.
- Confirm read-only. Run the safest command that proves or disproves it.
- Fix, then watch.
kubectl get pod -wuntil it’s Running and Ready.
The discipline is stopping at step 4 with one hypothesis instead of changing three things at once. When you change three things and it works, you’ve learned nothing and you’ll be back next week.
Don’t skip the probes and the image
Two final culprits worth their own mention because they masquerade as other problems:
- ImagePullBackOff looks like a crash but isn’t — it’s a bad image name, a private registry without
imagePullSecrets, or a deleted tag.describesays so plainly. - Readiness probe failures keep a pod out of the Service endpoints even when it’s “Running.” If traffic isn’t reaching a running pod, check
kubectl get endpointsand the readiness probe before you blame the network.
Reviewing the fix before it ships
Once you’ve found the cause, the fix is usually a manifest change — a corrected resource request, a relaxed probe, an added toleration. Before that goes to main, run it through a review pass. I push manifest changes through our Code Review tool to catch the second-order problems: a memory limit raised without raising the request, a probe loosened so far it never restarts a wedged pod.
CrashLoopBackOff and Pending stop being scary once you treat them as a two-branch decision tree. Get the events, classify, read the right logs, change one thing. Let AI read the firehose — you keep the hypothesis.
AI-generated diagnoses are assistive. Always confirm against your own cluster before applying changes.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.