Skip to content
CloudOps
Newsletter
All guides
AI for Kubernetes & Helm By James Joyner IV · · 16 min read

How to Use AI to Troubleshoot Kubernetes Clusters Faster

A copy-paste workflow to troubleshoot Kubernetes clusters faster with AI: capture commands, prompts, and example answers for CrashLoopBackOff, OOMKilled, and more.

  • #kubernetes
  • #ai
  • #troubleshooting
  • #k8s
  • #sre

DevOps engineer troubleshooting a Kubernetes cluster with an AI copilot

To use AI to troubleshoot a Kubernetes cluster faster, run a deterministic capture step first — collect the pod’s status, describe, recent events, and logs with kubectl — then paste that focused, secret-redacted evidence into an AI assistant and ask it for ranked hypotheses plus the single next command to run. The AI does the pattern-matching and reading you’d normally do by hand; you verify each hypothesis with a real command before changing anything in the cluster. The speed comes from the loop, not from magic: capture, analyze, verify, act. The model never touches your cluster, never holds your kubeconfig, and never gets the last word.

I have spent enough nights paging through kubectl describe output to know where the time actually goes. It is not the fixing. It is the reading — scrolling events, correlating timestamps, remembering which exit code means what. That is exactly the work an AI assistant is good at, and it is the part of an incident I now delegate first. This guide is the practical version: the exact capture commands, the prompts I paste, and what good answers look like for the failure modes you actually hit. If you want the conceptual background on how these tools work under the hood, the deeper dive in AI-assisted Kubernetes troubleshooting explained covers the architecture; this article is the runbook.

What cluster data should you give the AI?

An AI assistant is only as good as the evidence you hand it. Give it too little and it guesses; give it a raw firehose and it drowns. The sweet spot is a small, deterministic bundle that captures the failing object’s state, its recent history, and its logs.

For a single failing pod, this is my standard capture:

NS=production
POD=checkout-api-7d9f8c6b5-x2k4p

kubectl -n "$NS" get pod "$POD" -o wide
kubectl -n "$NS" describe pod "$POD"
kubectl -n "$NS" get events --sort-by=.lastTimestamp | tail -25
kubectl -n "$NS" logs "$POD" --all-containers --tail=100
kubectl -n "$NS" logs "$POD" --all-containers --previous --tail=100

The --previous logs matter more than anything else for a crashing pod, because the current container may be a fresh restart with nothing useful in it — the evidence is in the container that already died. The events line catches scheduling, image-pull, and probe failures that never make it into application logs at all.

Hand the AI those five outputs and it has roughly the same picture you’d build manually in ten minutes of scrolling. The difference is it reads all of it at once and cross-references the exit code against the events against the log tail without losing its place.

How do you build the troubleshooting loop?

The workflow that actually saves time is a four-step loop, and the discipline is keeping the model on the analysis side of it:

  1. Capture — run deterministic kubectl read commands. No AI here.
  2. Analyze — paste the output, ask for ranked hypotheses and the single most useful next command.
  3. Verify — you run that command and paste the result back.
  4. Act — once a hypothesis is confirmed, you make the change. The model proposes; you decide.

The reason this beats just asking “why is my pod broken” is that each loop narrows the search. The model isn’t trying to solve the whole incident in one shot — it’s playing twenty questions with the cluster, and you’re the one holding the cluster. Two or three rounds usually lands on root cause. If you want to anchor each round in a consistent prompt, browse the reusable prompt library for troubleshooting starters, or grab the curated Kubernetes Prompt Pack which bundles capture-and-diagnose prompts for exactly these scenarios.

How do you debug CrashLoopBackOff with AI?

CrashLoopBackOff is not a cause — it’s Kubernetes telling you a container keeps exiting and it’s backing off the restarts. The cause is in the exit code and the previous logs.

Capture:

kubectl -n production describe pod checkout-api-7d9f8c6b5-x2k4p \
  | grep -A5 "Last State"
kubectl -n production logs checkout-api-7d9f8c6b5-x2k4p --previous --tail=80

Prompt I paste:

This pod is in CrashLoopBackOff. Below is the "Last State" block from
kubectl describe and the --previous container logs. Give me the 3 most
likely root causes ranked by probability, and the single next kubectl
command to confirm the top one. Read-only commands only.

[paste describe Last State block]
[paste --previous logs]

Example AI answer:

The Last State shows Terminated with Reason: Error and Exit Code: 1, and the final log line is dial tcp 10.96.0.42:5432: connect: connection refused. Ranked causes: (1) the app cannot reach Postgres at startup and exits — most likely; (2) a failing migration on boot; (3) a missing/bad DATABASE_URL. To confirm (1), check whether the database Service has endpoints: kubectl -n production get endpoints postgres. If that returns <none>, the database, not your app, is the problem.

That is the whole point — exit code 1 plus a connection-refused log line is a thirty-second read for the model, and it immediately redirects you from “my app is broken” to “the dependency it needs isn’t there.” You run the suggested endpoints check, and the loop continues.

How do you find an OOMKilled container?

OOMKilled is the cleanest signal in Kubernetes because the kernel left a fingerprint. The trick is that the application logs almost never mention it — the process was killed mid-stride, so it had no chance to log a graceful death.

Capture:

kubectl -n production describe pod report-worker-66c4d9-abcde \
  | grep -A3 "Last State"
kubectl -n production get pod report-worker-66c4d9-abcde \
  -o jsonpath='{.spec.containers[*].resources}'

The describe shows Reason: OOMKilled and an exit code of 137 (128 + signal 9). Paste both that and the resources block and ask:

This container was OOMKilled (exit 137). Here is its resources block.
Is the memory limit too low, or does this look like a leak? What
should I check next and what would you set the limit to?

[paste Last State + resources]

Example AI answer:

Exit 137 confirms OOMKilled. Your limits.memory is 256Mi and requests.memory is 128Mi. If the pod crashes shortly after startup under normal load, the limit is simply too low — bump it and observe. If it runs fine for hours then OOMs, suspect a leak. Run kubectl -n production top pod report-worker-66c4d9-abcde --containers over a few minutes to see whether memory climbs steadily. As a starting point, set the limit to 512Mi and keep requests at 256Mi, then watch the trend.

The model can’t tell a leak from an undersized limit from a single snapshot — and a good answer says so and tells you which observation distinguishes them. That honesty is what you want; treat any AI that confidently picks one without that caveat as guessing.

How do you diagnose a Pending or unschedulable pod?

A pod stuck in Pending never started, so there are no application logs to read. The answer lives entirely in events and the scheduler’s verdict.

Capture:

kubectl -n production describe pod ml-trainer-0 | grep -A10 Events
kubectl get nodes -o wide
kubectl describe nodes | grep -A6 "Allocated resources"

Prompt:

This pod is stuck Pending. Here are the scheduling events and node
allocatable summary. Why won't it schedule, and what's the smallest
change to fix it?

[paste Events]
[paste node allocatable]

Example AI answer:

The event 0/5 nodes are available: 5 Insufficient memory is decisive — no node has enough free memory for this pod’s requests.memory: 16Gi. Your largest node shows ~12Gi allocatable. Options, smallest first: (1) lower the pod’s memory request if 16Gi is overstated; (2) add a node large enough; (3) if this is a node pool with autoscaling, confirm the autoscaler is healthy because it should have added a node. Check kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50 for scale-up errors.

Pending pods are where AI saves the most time per incident, because the scheduler’s messages (Insufficient cpu, node(s) had untolerated taint, node(s) had volume node affinity conflict) are precise but easy to misread when you’re tired. The model translates the message into the specific config field to change. If taints are involved, the Kubernetes Prompt Pack includes a taint/toleration and bin-packing diagnosis prompt that pairs well here.

How do you fix ImagePullBackOff with AI?

ImagePullBackOff and ErrImagePull are infrastructure problems, not application ones — the kubelet couldn’t get the image. The event message names the exact failure.

Capture:

kubectl -n production describe pod payments-api-5f7-zzzzz | grep -A8 Events

Paste the events and ask:

ImagePullBackOff. Here are the events. What's the precise cause and fix?

[paste Events]

The AI reads the event string and maps it for you:

The event reads Failed to pull image "registry.internal/payments:v2.3.1": ... unauthorized: authentication required. This is a registry-auth failure, not a missing tag. The pod’s ServiceAccount has no valid imagePullSecret for registry.internal, or the secret expired. Confirm with kubectl -n production get sa default -o jsonpath='{.imagePullSecrets}'. If empty, that’s your fix: attach a working pull secret.

Each ImagePullBackOff message — manifest unknown (bad tag), unauthorized (bad credentials), no such host (bad registry name or DNS), i/o timeout (network egress) — maps to a different fix, and the model knows the mapping cold. This is exactly the kind of lookup-heavy reasoning where a general assistant like Claude or ChatGPT shines, because the failure modes are well documented in its training data.

How do you debug Service connectivity and failed rollouts?

These two are the multi-step cases where the loop earns its keep.

For Service connectivity (“connection refused” between pods), the path is Service → endpoints → pod readiness → NetworkPolicy → app listening. Capture the chain:

kubectl -n production get svc checkout
kubectl -n production get endpoints checkout
kubectl -n production get pods -l app=checkout -o wide
kubectl -n production get networkpolicy

Ask the AI to walk the path and tell you where it breaks. A common answer: “get endpoints checkout returns <none> while pods exist — your Service selector doesn’t match the pod labels. Compare spec.selector on the Service to the pod labels.” That one mismatch is responsible for an embarrassing share of “the network is broken” tickets. For a full treatment of this path, see debugging Kubernetes service connectivity with AI.

For a failed rollout, capture the deployment’s status and the new ReplicaSet:

kubectl -n production rollout status deploy/checkout --timeout=10s
kubectl -n production describe deploy checkout | grep -A8 Conditions
kubectl -n production get rs -l app=checkout
kubectl -n production describe pod <newest-checkout-pod>

Paste those and the model will tell you whether the new pods are failing readiness probes, crash-looping on the new image, or whether the rollout is simply blocked waiting on maxUnavailable. The answer usually points back to one of the single-pod cases above — which is why getting fast at pod-level triage makes you fast at rollout triage too.

What tools should you use?

There are two families, and they’re complementary.

General-purpose assistantsClaude, ChatGPT, or any capable chat model — are what I reach for first. You paste kubectl output, they reason over it, and there’s no setup. They’re unbeatable for the open-ended “I have no idea what this means” moments and for reading dense describe output. The cost is that you’re the integration: you copy output in, copy commands out.

Purpose-built tooling — the k8sgpt-style approach — runs a deterministic rule engine against your cluster, collects the same evidence automatically, and only then sends a focused, redacted slice to a model for a plain-English explanation. This is the right call for repeatable, scoped diagnosis and for wiring AI triage into an on-call flow. The same two-layer pattern (deterministic capture, then AI explanation) powers the incident-response workflow on this site.

My rule of thumb: reach for a general assistant when you’re exploring an unfamiliar failure, and for k8sgpt-style tooling when you’ve standardized a diagnosis and want it to run consistently.

A copy-paste troubleshooting prompt template

This is the prompt I keep pinned. It works across every scenario above because it forces ranked hypotheses and a single next step instead of a wall of speculation:

You are an SRE pair-debugging a Kubernetes issue with me. You are
READ-ONLY: you may only suggest kubectl get/describe/logs/top commands,
never anything that mutates the cluster, and you never ask for my
kubeconfig or credentials.

Symptom: <one line, e.g. "checkout-api pod in CrashLoopBackOff">
Namespace: <ns>
What I've tried: <brief>

Evidence (secrets redacted):
--- kubectl get pod -o wide ---
<paste>
--- kubectl describe pod (Last State + Events) ---
<paste>
--- kubectl logs --previous --tail=100 ---
<paste>

Respond with:
1. Top 3 root-cause hypotheses, ranked, each with a one-line reason
   tied to the evidence.
2. The SINGLE next read-only command to confirm hypothesis #1.
3. What output would confirm vs. rule it out.
Do not propose a fix until I confirm the cause.

Save it. Swap the evidence each loop. The structure is what produces a fast, useful answer instead of a generic checklist. You’ll find variations of this and scenario-specific versions in the Kubernetes Prompt Pack.

How do you do this without leaking secrets?

This is the part nobody can skip. kubectl output is full of things you should not paste into a chat box, and the boundary is simple: the model gets evidence, never access.

  • Never give the model your kubeconfig, tokens, or cluster credentials. It does not need them. It reads output and suggests commands; you run them. If a tool asks to “connect to your cluster,” that tool — not the model — holds the access, and you vet it like any other cluster integration.

  • Redact before pasting. kubectl get secret -o yaml and kubectl describe can surface base64 secrets, connection strings, tokens, and internal hostnames. Strip them. A quick filter:

    kubectl -n production describe pod mypod \
      | grep -viE 'token|password|secret|key|authorization'
  • Prefer status over data. You almost never need the secret’s value to diagnose — you need to know whether it exists and is mounted. kubectl get secret db-creds (without -o yaml) tells you that safely.

  • Watch internal topology. Internal DNS names, IP ranges, and namespace structure are mild intel leakage. For most teams that’s acceptable; for regulated environments, use self-hosted tooling that keeps evidence inside your boundary.

The redaction step costs ten seconds and is non-negotiable. The whole value proposition collapses if the convenience of pasting output becomes the mechanism that leaks your database password.

Keep a human in the loop

The model is a fast, well-read junior engineer who has seen this failure a thousand times — and who occasionally states a wrong answer with total confidence. That combination is exactly why the loop ends with you, not with it.

Concretely: the AI proposes hypotheses and read-only commands. You run every command. You confirm the root cause against real output. You make the change — the kubectl edit, the rollout restart, the scale. Never let an AI agent hold write access to a production cluster and act on its own conclusions. A wrong hypothesis that only costs you a get command is a free guess; a wrong hypothesis wired to kubectl delete is an outage. Keep the destructive verbs on the human side of the boundary and you get all of the speed with none of the catastrophic downside.

That discipline is also what makes this sustainable. Because you verify every step, you stay sharp on the cluster instead of outsourcing your judgment. The AI compresses the reading; it doesn’t replace the operator.

FAQ

Can AI read kubectl describe and events output? Yes — this is its strongest use. kubectl describe and get events are verbose, structured text, and reading them is pattern-matching the AI does well. Paste the Events and Last State sections and it will correlate timestamps, exit codes, and messages faster than scrolling by hand. Just redact any secrets or tokens first.

Should I give an AI agent access to my Kubernetes cluster? For diagnosis, no — there’s no need. The high-value pattern is read-only: you run kubectl, paste output, the model interprets. If you use tooling that connects to the cluster (like k8sgpt-style agents), keep it scoped to read-only RBAC and never grant write or delete permissions to anything an AI drives autonomously.

Is a general assistant or a purpose-built tool better for Kubernetes troubleshooting? Both, for different jobs. Reach for a general assistant like Claude or ChatGPT when you’re exploring an unfamiliar failure and want open-ended reasoning over pasted output. Use k8sgpt-style tooling when you’ve standardized a diagnosis and want it to run automatically and consistently in your on-call flow.

How accurate is AI at finding Kubernetes root causes? Accurate at reading evidence, fallible at single-snapshot judgment. For deterministic signals — exit code 137 is OOMKilled, Insufficient memory is a scheduling shortfall — it’s reliable. For cases needing trend data (a memory leak vs. an undersized limit), a good answer tells you which observation distinguishes them rather than guessing. Always verify with a real command before acting.

What’s the single biggest time-saver? Capturing --previous container logs and the Last State block before you ask anything. For crashing pods, that’s where the actual cause lives, and handing it to the model up front skips the back-and-forth entirely.

Conclusion

Using AI to troubleshoot Kubernetes faster isn’t about a smarter tool replacing you — it’s about putting the tedious reading where it belongs. Capture deterministic evidence with kubectl, hand the AI a focused and redacted slice, get ranked hypotheses and one next command, verify it yourself, then act. Keep the model read-only, keep the credentials out of the chat box, and keep yourself as the one who decides. Do that and the three-AM scroll-through-events ritual turns into a tight two-minute loop. Start with the prompt template above, browse the rest of the Kubernetes and Helm guides, and pick up the Kubernetes Prompt Pack if you want the scenario-specific prompts ready to paste.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.