Kubernetes Node NotReady Diagnosis Prompt
Diagnose why a Kubernetes Node is `NotReady` — kubelet failures, container runtime crashes, disk/PID pressure, network plugin down, certificate expiry.
- Target user
- Kubernetes platform engineers and cluster operators
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes cluster operator who has recovered hundreds of `NotReady` nodes in production — managed services (EKS/GKE/AKS) and self-managed (kubeadm, kops, k3s, RKE).
I will provide:
- `kubectl describe node <node>` (Conditions section is critical)
- The kubelet status from the affected node: `systemctl status kubelet` + `journalctl -u kubelet --since "30 min ago" -n 200`
- The container runtime status: `systemctl status containerd` (or crio) + `journalctl -u containerd -n 200`
- Output of `crictl ps -a` and `crictl info` from the node (if accessible)
- Output of `df -hT`, `df -i`, `free -h` on the node
- Whether the node is part of a managed service (EKS/GKE/AKS) or self-managed
- When the node went NotReady and what changed (deploy, upgrade, reboot, autoscaler event)
Your job:
1. **Decode the Node Conditions** from `kubectl describe`:
- `Ready=False` reason — the most informative field
- `KubeletNotReady`: container runtime issue, network plugin not ready, or kubelet just started
- `NodeStatusUnknown`: kubelet stopped reporting (heartbeat lost; node may have rebooted or crashed)
- `DiskPressure=True`: imagefs or rootfs above eviction threshold
- `MemoryPressure=True`: node memory below `--eviction-hard` threshold
- `PIDPressure=True`: kernel PID exhaustion
- `NetworkUnavailable=True`: CNI not configured (common right after node creation)
2. **For `Ready=False, KubeletNotReady`**:
- Container runtime crashed/stopped? `systemctl status containerd`
- CNI plugin missing/crashed? `ls /etc/cni/net.d/` (should have `*.conflist`); `crictl info` shows `runtimeReady=true networkReady=false` when CNI is the cause
- Kubelet can't reach API server? `journalctl -u kubelet | grep -i "unable\|refused\|x509"`
- Kubelet client cert expired? Common on long-lived clusters; symptom is `x509: certificate has expired`
3. **For `NodeStatusUnknown`**:
- Node entirely unreachable? Try SSH, console, cloud provider status
- Network partition? Other nodes Ready?
- Kernel panic / hard hang? Check console / serial output
4. **For Pressure conditions**:
- `DiskPressure`: `df -hT` shows what's full; common causes are container logs in `/var/lib/docker` or `/var/lib/containerd`, kubelet's image GC failing, journald unbounded
- `MemoryPressure`: `free -h`, then look at top RSS consumers — often a single misbehaving pod with no memory limit
- `PIDPressure`: `ps -ef | wc -l` and `cat /proc/sys/kernel/pid_max`; usually a fork bomb pod
5. **Walk the kubelet failure modes** if logs show issues:
- `failed to start ContainerManager: failed to get rootfs info`: runtime/storage issue
- `Error getting node`: API server connectivity
- `Unable to register node`: TLS bootstrap or token issue
- `network plugin is not ready: cni config uninitialized`: CNI install in progress or failed
- `OOMKilled` of kubelet itself: rare but possible; kubelet cgroup pressure
6. **For managed services (EKS/GKE/AKS)**:
- Node may be replaced automatically by the node group; don't fight the autoscaler
- Drain + delete is often the right answer over manual repair
- Cloud-side events (instance termination, AZ issue) are out-of-band
7. **Recommend safest recovery in order**:
- **If pods are still running and healthy**: prefer `kubectl cordon` to stop new schedules, then investigate without panic
- **If pods are unhealthy**: `kubectl drain` to evict (respecting PDBs)
- **If unrecoverable**: terminate the node and let the autoscaler / DaemonSet recreate
8. Mark DESTRUCTIVE clearly: `kubectl delete node` (loses node object; pods reschedule but state lost), force-restart kubelet during active workload, `crictl rmi` to free disk (removes images other pods may need).
---
Node name + role: [worker / control-plane]
Cluster type: [EKS / GKE / AKS / kubeadm / k3s / RKE / OpenShift]
Symptom timeline: [when did it go NotReady; what changed before]
`kubectl describe node <node>` (especially Conditions):
```
[PASTE]
```
`systemctl status kubelet` + `journalctl -u kubelet -n 200`:
```
[PASTE]
```
`systemctl status containerd` (or crio):
```
[PASTE]
```
`crictl ps -a` and `crictl info` (if accessible):
```
[PASTE]
```
Resource state: `df -hT`, `df -i`, `free -h`:
```
[PASTE]
```
Why this prompt works
Node NotReady is a category label, not a diagnosis. It can mean kubelet crashed, runtime crashed, CNI broke, disk full, PID exhausted, certificate expired, or network partition. Each has a different recovery path. This prompt forces a Conditions-first walk so you don’t restart kubelet hoping it fixes a CA expiry.
How to use it
- Always include the Conditions section from
kubectl describe node. The Reason and Message fields tell you which subsystem is unhealthy. - Capture the kubelet journal log around the failure time. Just
--since "30 min ago"is fine; the first error in that window is usually the real one. - For multi-node failures, capture from at least two nodes — common to find a control-plane or CNI control issue rather than per-node.
Useful commands
# Cluster + node view
kubectl get nodes
kubectl get nodes -o wide
kubectl describe node <node> | head -60
kubectl top node
# Node conditions only
kubectl get node <node> -o jsonpath='{range .status.conditions[*]}{.type}={.status}{" "}{.reason}{" "}{.message}{"\n"}{end}'
# Events related to the node
kubectl get events --field-selector involvedObject.name=<node>,involvedObject.kind=Node
# Pods on the affected node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
# Cordon / drain (safest first)
kubectl cordon <node> # stop new schedules
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=10m
# On the node itself (SSH or console)
sudo systemctl status kubelet
sudo systemctl status containerd # or crio
sudo journalctl -u kubelet --since "30 min ago" -n 200
sudo journalctl -u containerd --since "30 min ago" -n 200
sudo crictl info
sudo crictl ps -a
sudo crictl images
sudo crictl logs <container-id>
# Disk / inodes / memory / PID
df -hT
df -i
free -h
ps -ef | wc -l
cat /proc/sys/kernel/pid_max
# CNI sanity
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist
# Kubelet client cert (expiry check)
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
# Recover (in order of preference)
sudo systemctl restart kubelet # safe; brief blip
sudo systemctl restart containerd && sudo systemctl restart kubelet # if runtime stuck
sudo crictl rmi --prune # free disk; safe but cache loss
# Last resort: cordon, drain, terminate node, let autoscaler replace
Condition decoder
| Condition Reason | Most likely cause | First fix |
|---|---|---|
KubeletNotReady + runtime: ...not ready | Container runtime down | systemctl restart containerd |
KubeletNotReady + network plugin is not ready | CNI install incomplete or daemonset crashing | Check CNI pods in kube-system |
KubeletNotReady + unable to register | TLS bootstrap / token issue (new node) | Check bootstrap token, CA |
NodeStatusUnknown | Kubelet stopped reporting | SSH/console; check node alive at all |
DiskPressure=True | imagefs/rootfs above eviction threshold | crictl rmi --prune; clean logs |
MemoryPressure=True | Below --eviction-hard memory.available | Identify hog pod; restart kubelet may release once pod evicted |
PIDPressure=True | Kernel PID exhausted | Find runaway-fork pod; reboot if kernel won’t recover |
Ready=Unknown + recent reboot | Normal during reboot | Wait 2 min |
Kubelet cert expiry (common silent killer)
# Check expiry
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates
# Auto-rotation should handle this, but if it didn't:
sudo systemctl restart kubelet
# Watch for rotation in logs:
sudo journalctl -u kubelet -f | grep -i "certificate\|csr"
# On kubeadm clusters:
sudo kubeadm certs check-expiration
sudo kubeadm certs renew kubelet-client # or 'all', with care
DiskPressure recovery (most common cause)
# What's filling /var/lib?
sudo du -shx /var/lib/* | sort -h | tail
# Common culprits:
# /var/lib/containerd — image / overlay layers
# /var/lib/kubelet/pods/*/volumes — emptyDir, downwardAPI, etc.
# /var/log — pod logs (per-pod dirs)
# Image GC
sudo crictl rmi --prune # safe
# (kubelet's own image GC runs on configured thresholds)
# Container log size — set in CRI runtime config
sudo cat /etc/containerd/config.toml | grep -A3 size
# If logs are unbounded, set max_size = "100MB" and reload
# Journald
sudo journalctl --vacuum-size=500M
Common findings this catches
- DiskPressure due to unbounded container logs in
/var/lib/containerd/containers/<id>/<id>-json.log. Set runtime log rotation. - MemoryPressure from one pod without memory limit consuming the node — set a default LimitRange in the namespace.
- CNI pod CrashLoopBackOff (e.g., calico-node) blocking all node-readiness — check the CNI pod, not the node directly.
- Kubelet TLS expired on a long-lived cluster that never restarted — auto-rotation requires a kubelet restart in some versions.
- PID exhaustion from a buggy app fork-storming in one pod — kill the pod, raise
pid_max, set--pod-max-pidsin kubelet config. - Cloud node terminated by autoscaler while still showing in
kubectl get nodes— wait fornode-controllerto remove (default 5 min), thenkubectl delete node.
When to escalate
- Multi-node simultaneous NotReady → cluster-wide issue (CNI control plane, API server, networking) — fix at that layer, not per-node.
- Control-plane node NotReady → engage cluster admin; etcd implications.
- Recurring NotReady on the same node → hardware issue or persistent config drift; investigate or replace the node.
- Managed service nodes — most providers prefer “terminate and replace” over manual repair.
Related prompts
-
Kubernetes Pod Troubleshooting Prompt
Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.
-
Linux Disk Full / Inode Exhaustion Diagnosis Prompt
Diagnose why a Linux filesystem is full or out of inodes — including deleted-but-held files, journal bloat, reserved blocks, and hidden mount-shadowed data.
-
Linux OOM Kill & Memory Pressure Investigation Prompt
Diagnose OOM kills, memory pressure, swap thrashing, slab bloat, and cgroup memory limit failures on Linux servers from dmesg OOM banners and /proc data.