AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Node NotReady Diagnosis Prompt

Diagnose why a Kubernetes Node is `NotReady` — kubelet failures, container runtime crashes, disk/PID pressure, network plugin down, certificate expiry.

Target user: Kubernetes platform engineers and cluster operators
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior Kubernetes cluster operator who has recovered hundreds of `NotReady` nodes in production — managed services (EKS/GKE/AKS) and self-managed (kubeadm, kops, k3s, RKE).

I will provide:
- `kubectl describe node <node>` (Conditions section is critical)
- The kubelet status from the affected node: `systemctl status kubelet` + `journalctl -u kubelet --since "30 min ago" -n 200`
- The container runtime status: `systemctl status containerd` (or crio) + `journalctl -u containerd -n 200`
- Output of `crictl ps -a` and `crictl info` from the node (if accessible)
- Output of `df -hT`, `df -i`, `free -h` on the node
- Whether the node is part of a managed service (EKS/GKE/AKS) or self-managed
- When the node went NotReady and what changed (deploy, upgrade, reboot, autoscaler event)

Your job:

1. **Decode the Node Conditions** from `kubectl describe`:
   - `Ready=False` reason — the most informative field
     - `KubeletNotReady`: container runtime issue, network plugin not ready, or kubelet just started
     - `NodeStatusUnknown`: kubelet stopped reporting (heartbeat lost; node may have rebooted or crashed)
   - `DiskPressure=True`: imagefs or rootfs above eviction threshold
   - `MemoryPressure=True`: node memory below `--eviction-hard` threshold
   - `PIDPressure=True`: kernel PID exhaustion
   - `NetworkUnavailable=True`: CNI not configured (common right after node creation)
2. **For `Ready=False, KubeletNotReady`**:
   - Container runtime crashed/stopped? `systemctl status containerd`
   - CNI plugin missing/crashed? `ls /etc/cni/net.d/` (should have `*.conflist`); `crictl info` shows `runtimeReady=true networkReady=false` when CNI is the cause
   - Kubelet can't reach API server? `journalctl -u kubelet | grep -i "unable\|refused\|x509"`
   - Kubelet client cert expired? Common on long-lived clusters; symptom is `x509: certificate has expired`
3. **For `NodeStatusUnknown`**:
   - Node entirely unreachable? Try SSH, console, cloud provider status
   - Network partition? Other nodes Ready?
   - Kernel panic / hard hang? Check console / serial output
4. **For Pressure conditions**:
   - `DiskPressure`: `df -hT` shows what's full; common causes are container logs in `/var/lib/docker` or `/var/lib/containerd`, kubelet's image GC failing, journald unbounded
   - `MemoryPressure`: `free -h`, then look at top RSS consumers — often a single misbehaving pod with no memory limit
   - `PIDPressure`: `ps -ef | wc -l` and `cat /proc/sys/kernel/pid_max`; usually a fork bomb pod
5. **Walk the kubelet failure modes** if logs show issues:
   - `failed to start ContainerManager: failed to get rootfs info`: runtime/storage issue
   - `Error getting node`: API server connectivity
   - `Unable to register node`: TLS bootstrap or token issue
   - `network plugin is not ready: cni config uninitialized`: CNI install in progress or failed
   - `OOMKilled` of kubelet itself: rare but possible; kubelet cgroup pressure
6. **For managed services (EKS/GKE/AKS)**:
   - Node may be replaced automatically by the node group; don't fight the autoscaler
   - Drain + delete is often the right answer over manual repair
   - Cloud-side events (instance termination, AZ issue) are out-of-band
7. **Recommend safest recovery in order**:
   - **If pods are still running and healthy**: prefer `kubectl cordon` to stop new schedules, then investigate without panic
   - **If pods are unhealthy**: `kubectl drain` to evict (respecting PDBs)
   - **If unrecoverable**: terminate the node and let the autoscaler / DaemonSet recreate
8. Mark DESTRUCTIVE clearly: `kubectl delete node` (loses node object; pods reschedule but state lost), force-restart kubelet during active workload, `crictl rmi` to free disk (removes images other pods may need).

---

Node name + role: [worker / control-plane]
Cluster type: [EKS / GKE / AKS / kubeadm / k3s / RKE / OpenShift]
Symptom timeline: [when did it go NotReady; what changed before]
`kubectl describe node <node>` (especially Conditions):
```
[PASTE]
```
`systemctl status kubelet` + `journalctl -u kubelet -n 200`:
```
[PASTE]
```
`systemctl status containerd` (or crio):
```
[PASTE]
```
`crictl ps -a` and `crictl info` (if accessible):
```
[PASTE]
```
Resource state: `df -hT`, `df -i`, `free -h`:
```
[PASTE]
```

Run this prompt with AI

Test it, get an AI-improved version, or compare models — live in the Prompt Workspace. No copy-paste.

Safety notes

`kubectl delete node <node>` removes the Node object but does NOT terminate the underlying VM. The kubelet will re-register on next boot. To truly remove, terminate the cloud instance OR remove from the node group.
Restarting kubelet on a working node is mostly safe BUT will briefly interrupt pod heartbeats; expect a few-second NotReady blip across the cluster's view.
`crictl rmi --prune` frees disk by deleting unreferenced images. Running pods are fine, but new pods may need to re-pull. Don't do this if image pull rate is limited.
Draining a control-plane node requires manual etcd member handling on self-managed clusters. Don't drain a control-plane node casually.
Force-removing finalizers on Node objects (`kubectl patch node ... -p '{"metadata":{"finalizers":null}}'`) is for stuck-Terminating nodes ONLY, after confirming the underlying VM is gone.
Don't bypass the autoscaler by manually creating nodes on managed services; the next scale event will tear them down.

Why this prompt works

Node NotReady is a category label, not a diagnosis. It can mean kubelet crashed, runtime crashed, CNI broke, disk full, PID exhausted, certificate expired, or network partition. Each has a different recovery path. This prompt forces a Conditions-first walk so you don’t restart kubelet hoping it fixes a CA expiry.

How to use it

Always include the Conditions section from kubectl describe node. The Reason and Message fields tell you which subsystem is unhealthy.
Capture the kubelet journal log around the failure time. Just --since "30 min ago" is fine; the first error in that window is usually the real one.
For multi-node failures, capture from at least two nodes — common to find a control-plane or CNI control issue rather than per-node.

Useful commands

# Cluster + node view
kubectl get nodes
kubectl get nodes -o wide
kubectl describe node <node> | head -60
kubectl top node

# Node conditions only
kubectl get node <node> -o jsonpath='{range .status.conditions[*]}{.type}={.status}{" "}{.reason}{" "}{.message}{"\n"}{end}'

# Events related to the node
kubectl get events --field-selector involvedObject.name=<node>,involvedObject.kind=Node

# Pods on the affected node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

# Cordon / drain (safest first)
kubectl cordon <node>                      # stop new schedules
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=10m

# On the node itself (SSH or console)
sudo systemctl status kubelet
sudo systemctl status containerd            # or crio
sudo journalctl -u kubelet --since "30 min ago" -n 200
sudo journalctl -u containerd --since "30 min ago" -n 200

sudo crictl info
sudo crictl ps -a
sudo crictl images
sudo crictl logs <container-id>

# Disk / inodes / memory / PID
df -hT
df -i
free -h
ps -ef | wc -l
cat /proc/sys/kernel/pid_max

# CNI sanity
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist

# Kubelet client cert (expiry check)
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Recover (in order of preference)
sudo systemctl restart kubelet                          # safe; brief blip
sudo systemctl restart containerd && sudo systemctl restart kubelet   # if runtime stuck
sudo crictl rmi --prune                                 # free disk; safe but cache loss
# Last resort: cordon, drain, terminate node, let autoscaler replace

Condition decoder

Condition Reason	Most likely cause	First fix
`KubeletNotReady` + `runtime: ...not ready`	Container runtime down	`systemctl restart containerd`
`KubeletNotReady` + `network plugin is not ready`	CNI install incomplete or daemonset crashing	Check CNI pods in `kube-system`
`KubeletNotReady` + `unable to register`	TLS bootstrap / token issue (new node)	Check bootstrap token, CA
`NodeStatusUnknown`	Kubelet stopped reporting	SSH/console; check node alive at all
`DiskPressure=True`	imagefs/rootfs above eviction threshold	`crictl rmi --prune`; clean logs
`MemoryPressure=True`	Below `--eviction-hard memory.available`	Identify hog pod; restart kubelet may release once pod evicted
`PIDPressure=True`	Kernel PID exhausted	Find runaway-fork pod; reboot if kernel won’t recover
`Ready=Unknown` + recent reboot	Normal during reboot	Wait 2 min

Kubelet cert expiry (common silent killer)

# Check expiry
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Auto-rotation should handle this, but if it didn't:
sudo systemctl restart kubelet
# Watch for rotation in logs:
sudo journalctl -u kubelet -f | grep -i "certificate\|csr"

# On kubeadm clusters:
sudo kubeadm certs check-expiration
sudo kubeadm certs renew kubelet-client    # or 'all', with care

DiskPressure recovery (most common cause)

# What's filling /var/lib?
sudo du -shx /var/lib/* | sort -h | tail
# Common culprits:
#   /var/lib/containerd  — image / overlay layers
#   /var/lib/kubelet/pods/*/volumes — emptyDir, downwardAPI, etc.
#   /var/log              — pod logs (per-pod dirs)

# Image GC
sudo crictl rmi --prune                          # safe
# (kubelet's own image GC runs on configured thresholds)

# Container log size — set in CRI runtime config
sudo cat /etc/containerd/config.toml | grep -A3 size
# If logs are unbounded, set max_size = "100MB" and reload

# Journald
sudo journalctl --vacuum-size=500M

Common findings this catches

DiskPressure due to unbounded container logs in /var/lib/containerd/containers/<id>/<id>-json.log. Set runtime log rotation.
MemoryPressure from one pod without memory limit consuming the node — set a default LimitRange in the namespace.
CNI pod CrashLoopBackOff (e.g., calico-node) blocking all node-readiness — check the CNI pod, not the node directly.
Kubelet TLS expired on a long-lived cluster that never restarted — auto-rotation requires a kubelet restart in some versions.
PID exhaustion from a buggy app fork-storming in one pod — kill the pod, raise pid_max, set --pod-max-pids in kubelet config.
Cloud node terminated by autoscaler while still showing in kubectl get nodes — wait for node-controller to remove (default 5 min), then kubectl delete node.

When to escalate

Multi-node simultaneous NotReady → cluster-wide issue (CNI control plane, API server, networking) — fix at that layer, not per-node.
Control-plane node NotReady → engage cluster admin; etcd implications.
Recurring NotReady on the same node → hardware issue or persistent config drift; investigate or replace the node.
Managed service nodes — most providers prefer “terminate and replace” over manual repair.

Related prompts

More Kubernetes & Helm prompts & error guides

Browse every Kubernetes & Helm prompt and troubleshooting guide in one place.

Free download · 368-page PDF

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
Instant PDF download — yours free, forever
Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.