Debugging CrashLoopBackOff and Pending Pods Faster With AI

In twenty-five years of running production systems, two pod states have eaten more of my afternoons than any others: CrashLoopBackOff and Pending. They look scary, they’re vague, and the actual cause is almost always one of a small handful of things. The trick is having a checklist — and letting AI read the wall of logs while you think.

This is the systematic approach I use, plus where an AI assistant genuinely saves time.

First: read the state, don’t guess

Before you touch anything, get the facts. Ninety percent of the diagnosis is in three commands:

kubectl get pod my-pod -o wide
kubectl describe pod my-pod
kubectl logs my-pod --previous

describe gives you the events and the container state. logs --previous gives you the output of the container that just died — which is the one you actually care about in a crash loop. People forget --previous constantly and end up reading the logs of a container that hasn’t started yet.

CrashLoopBackOff: the container starts and dies

CrashLoopBackOff means the kubelet started your container, it exited, and Kubernetes is backing off before retrying. The pod isn’t broken — your process is exiting. Look at the exit code in describe:

Exit 0 — your process ran and finished. Usually a missing long-running command, or an entrypoint that isn’t actually a server.
Exit 1 / 2 — application error. Read the logs.
Exit 137 — OOMKilled. Your container hit its memory limit. Check State.Reason.
Exit 139 — segfault (SIGSEGV).
Exit 143 — SIGTERM, often a failing liveness probe killing the container.

That last one is the sneaky one. A liveness probe that’s too aggressive will kill a healthy-but-slow-starting app forever. Check your initialDelaySeconds and failureThreshold.

Where AI earns its keep

Paste the describe output and the --previous logs into your assistant and ask:

“This pod is in CrashLoopBackOff. Here’s the describe output and the previous container logs. What’s the exit code, what’s the most likely cause, and what’s the single read-only command to confirm it?”

The model is fast at correlating a stack trace with an exit code and a probe config — the boring cross-referencing you’d do by hand. Keep a library of these Kubernetes prompts so you’re not authoring them mid-incident.

Pending: the pod can’t be scheduled

Pending means the scheduler can’t place the pod on any node. The reason is always in the events:

kubectl describe pod my-pod | grep -A10 Events

Common causes, in rough order of frequency:

1. Insufficient resources

0/5 nodes are available: insufficient cpu — your requests are bigger than any node’s free capacity. Check what you asked for:

kubectl get pod my-pod -o jsonpath='{.spec.containers[*].resources}'
kubectl top nodes

Either lower the request or add capacity. A surprising number of “the cluster is broken” tickets are a request typo — 2 CPU instead of 200m.

2. Node selectors and affinity

node(s) didn't match node selector. Your pod demands a label no node has. Check nodeSelector, affinity, and tolerations against your actual node labels with kubectl get nodes --show-labels.

3. Taints with no toleration

Control-plane nodes and GPU nodes are commonly tainted. If every available node is tainted and your pod tolerates none of them, it stays Pending forever.

4. PVC not bound

pod has unbound immediate PersistentVolumeClaims. The pod is waiting on storage that doesn’t exist. Check kubectl get pvc — if it’s also Pending, your StorageClass or provisioner is the real problem.

A debugging loop that actually converges

Here’s the loop I run, every time, regardless of which state I’m in:

Get the events. describe first, always.
Classify the failure. Crash (process dies) or schedule (can’t place)?
Read the right logs. --previous for crashes.
Form one hypothesis. Not five. The most likely one from the events.
Confirm read-only. Run the safest command that proves or disproves it.
Fix, then watch. kubectl get pod -w until it’s Running and Ready.

The discipline is stopping at step 4 with one hypothesis instead of changing three things at once. When you change three things and it works, you’ve learned nothing and you’ll be back next week.

Don’t skip the probes and the image

Two final culprits worth their own mention because they masquerade as other problems:

ImagePullBackOff looks like a crash but isn’t — it’s a bad image name, a private registry without imagePullSecrets, or a deleted tag. describe says so plainly.
Readiness probe failures keep a pod out of the Service endpoints even when it’s “Running.” If traffic isn’t reaching a running pod, check kubectl get endpoints and the readiness probe before you blame the network.

Reviewing the fix before it ships

Once you’ve found the cause, the fix is usually a manifest change — a corrected resource request, a relaxed probe, an added toleration. Before that goes to main, run it through a review pass. I push manifest changes through our Code Review tool to catch the second-order problems: a memory limit raised without raising the request, a probe loosened so far it never restarts a wedged pod.

CrashLoopBackOff and Pending stop being scary once you treat them as a two-branch decision tree. Get the events, classify, read the right logs, change one thing. Let AI read the firehose — you keep the hypothesis.

AI-generated diagnoses are assistive. Always confirm against your own cluster before applying changes.