Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Node NotReady Diagnosis Prompt

Diagnose why a Kubernetes Node is `NotReady` — kubelet failures, container runtime crashes, disk/PID pressure, network plugin down, certificate expiry.

Target user
Kubernetes platform engineers and cluster operators
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes cluster operator who has recovered hundreds of `NotReady` nodes in production — managed services (EKS/GKE/AKS) and self-managed (kubeadm, kops, k3s, RKE).

I will provide:
- `kubectl describe node <node>` (Conditions section is critical)
- The kubelet status from the affected node: `systemctl status kubelet` + `journalctl -u kubelet --since "30 min ago" -n 200`
- The container runtime status: `systemctl status containerd` (or crio) + `journalctl -u containerd -n 200`
- Output of `crictl ps -a` and `crictl info` from the node (if accessible)
- Output of `df -hT`, `df -i`, `free -h` on the node
- Whether the node is part of a managed service (EKS/GKE/AKS) or self-managed
- When the node went NotReady and what changed (deploy, upgrade, reboot, autoscaler event)

Your job:

1. **Decode the Node Conditions** from `kubectl describe`:
   - `Ready=False` reason — the most informative field
     - `KubeletNotReady`: container runtime issue, network plugin not ready, or kubelet just started
     - `NodeStatusUnknown`: kubelet stopped reporting (heartbeat lost; node may have rebooted or crashed)
   - `DiskPressure=True`: imagefs or rootfs above eviction threshold
   - `MemoryPressure=True`: node memory below `--eviction-hard` threshold
   - `PIDPressure=True`: kernel PID exhaustion
   - `NetworkUnavailable=True`: CNI not configured (common right after node creation)
2. **For `Ready=False, KubeletNotReady`**:
   - Container runtime crashed/stopped? `systemctl status containerd`
   - CNI plugin missing/crashed? `ls /etc/cni/net.d/` (should have `*.conflist`); `crictl info` shows `runtimeReady=true networkReady=false` when CNI is the cause
   - Kubelet can't reach API server? `journalctl -u kubelet | grep -i "unable\|refused\|x509"`
   - Kubelet client cert expired? Common on long-lived clusters; symptom is `x509: certificate has expired`
3. **For `NodeStatusUnknown`**:
   - Node entirely unreachable? Try SSH, console, cloud provider status
   - Network partition? Other nodes Ready?
   - Kernel panic / hard hang? Check console / serial output
4. **For Pressure conditions**:
   - `DiskPressure`: `df -hT` shows what's full; common causes are container logs in `/var/lib/docker` or `/var/lib/containerd`, kubelet's image GC failing, journald unbounded
   - `MemoryPressure`: `free -h`, then look at top RSS consumers — often a single misbehaving pod with no memory limit
   - `PIDPressure`: `ps -ef | wc -l` and `cat /proc/sys/kernel/pid_max`; usually a fork bomb pod
5. **Walk the kubelet failure modes** if logs show issues:
   - `failed to start ContainerManager: failed to get rootfs info`: runtime/storage issue
   - `Error getting node`: API server connectivity
   - `Unable to register node`: TLS bootstrap or token issue
   - `network plugin is not ready: cni config uninitialized`: CNI install in progress or failed
   - `OOMKilled` of kubelet itself: rare but possible; kubelet cgroup pressure
6. **For managed services (EKS/GKE/AKS)**:
   - Node may be replaced automatically by the node group; don't fight the autoscaler
   - Drain + delete is often the right answer over manual repair
   - Cloud-side events (instance termination, AZ issue) are out-of-band
7. **Recommend safest recovery in order**:
   - **If pods are still running and healthy**: prefer `kubectl cordon` to stop new schedules, then investigate without panic
   - **If pods are unhealthy**: `kubectl drain` to evict (respecting PDBs)
   - **If unrecoverable**: terminate the node and let the autoscaler / DaemonSet recreate
8. Mark DESTRUCTIVE clearly: `kubectl delete node` (loses node object; pods reschedule but state lost), force-restart kubelet during active workload, `crictl rmi` to free disk (removes images other pods may need).

---

Node name + role: [worker / control-plane]
Cluster type: [EKS / GKE / AKS / kubeadm / k3s / RKE / OpenShift]
Symptom timeline: [when did it go NotReady; what changed before]
`kubectl describe node <node>` (especially Conditions):
```
[PASTE]
```
`systemctl status kubelet` + `journalctl -u kubelet -n 200`:
```
[PASTE]
```
`systemctl status containerd` (or crio):
```
[PASTE]
```
`crictl ps -a` and `crictl info` (if accessible):
```
[PASTE]
```
Resource state: `df -hT`, `df -i`, `free -h`:
```
[PASTE]
```

Why this prompt works

Node NotReady is a category label, not a diagnosis. It can mean kubelet crashed, runtime crashed, CNI broke, disk full, PID exhausted, certificate expired, or network partition. Each has a different recovery path. This prompt forces a Conditions-first walk so you don’t restart kubelet hoping it fixes a CA expiry.

How to use it

  1. Always include the Conditions section from kubectl describe node. The Reason and Message fields tell you which subsystem is unhealthy.
  2. Capture the kubelet journal log around the failure time. Just --since "30 min ago" is fine; the first error in that window is usually the real one.
  3. For multi-node failures, capture from at least two nodes — common to find a control-plane or CNI control issue rather than per-node.

Useful commands

# Cluster + node view
kubectl get nodes
kubectl get nodes -o wide
kubectl describe node <node> | head -60
kubectl top node

# Node conditions only
kubectl get node <node> -o jsonpath='{range .status.conditions[*]}{.type}={.status}{" "}{.reason}{" "}{.message}{"\n"}{end}'

# Events related to the node
kubectl get events --field-selector involvedObject.name=<node>,involvedObject.kind=Node

# Pods on the affected node
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

# Cordon / drain (safest first)
kubectl cordon <node>                      # stop new schedules
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=10m

# On the node itself (SSH or console)
sudo systemctl status kubelet
sudo systemctl status containerd            # or crio
sudo journalctl -u kubelet --since "30 min ago" -n 200
sudo journalctl -u containerd --since "30 min ago" -n 200

sudo crictl info
sudo crictl ps -a
sudo crictl images
sudo crictl logs <container-id>

# Disk / inodes / memory / PID
df -hT
df -i
free -h
ps -ef | wc -l
cat /proc/sys/kernel/pid_max

# CNI sanity
ls /etc/cni/net.d/
cat /etc/cni/net.d/*.conflist

# Kubelet client cert (expiry check)
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Recover (in order of preference)
sudo systemctl restart kubelet                          # safe; brief blip
sudo systemctl restart containerd && sudo systemctl restart kubelet   # if runtime stuck
sudo crictl rmi --prune                                 # free disk; safe but cache loss
# Last resort: cordon, drain, terminate node, let autoscaler replace

Condition decoder

Condition ReasonMost likely causeFirst fix
KubeletNotReady + runtime: ...not readyContainer runtime downsystemctl restart containerd
KubeletNotReady + network plugin is not readyCNI install incomplete or daemonset crashingCheck CNI pods in kube-system
KubeletNotReady + unable to registerTLS bootstrap / token issue (new node)Check bootstrap token, CA
NodeStatusUnknownKubelet stopped reportingSSH/console; check node alive at all
DiskPressure=Trueimagefs/rootfs above eviction thresholdcrictl rmi --prune; clean logs
MemoryPressure=TrueBelow --eviction-hard memory.availableIdentify hog pod; restart kubelet may release once pod evicted
PIDPressure=TrueKernel PID exhaustedFind runaway-fork pod; reboot if kernel won’t recover
Ready=Unknown + recent rebootNormal during rebootWait 2 min

Kubelet cert expiry (common silent killer)

# Check expiry
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -dates

# Auto-rotation should handle this, but if it didn't:
sudo systemctl restart kubelet
# Watch for rotation in logs:
sudo journalctl -u kubelet -f | grep -i "certificate\|csr"

# On kubeadm clusters:
sudo kubeadm certs check-expiration
sudo kubeadm certs renew kubelet-client    # or 'all', with care

DiskPressure recovery (most common cause)

# What's filling /var/lib?
sudo du -shx /var/lib/* | sort -h | tail
# Common culprits:
#   /var/lib/containerd  — image / overlay layers
#   /var/lib/kubelet/pods/*/volumes — emptyDir, downwardAPI, etc.
#   /var/log              — pod logs (per-pod dirs)

# Image GC
sudo crictl rmi --prune                          # safe
# (kubelet's own image GC runs on configured thresholds)

# Container log size — set in CRI runtime config
sudo cat /etc/containerd/config.toml | grep -A3 size
# If logs are unbounded, set max_size = "100MB" and reload

# Journald
sudo journalctl --vacuum-size=500M

Common findings this catches

  • DiskPressure due to unbounded container logs in /var/lib/containerd/containers/<id>/<id>-json.log. Set runtime log rotation.
  • MemoryPressure from one pod without memory limit consuming the node — set a default LimitRange in the namespace.
  • CNI pod CrashLoopBackOff (e.g., calico-node) blocking all node-readiness — check the CNI pod, not the node directly.
  • Kubelet TLS expired on a long-lived cluster that never restarted — auto-rotation requires a kubelet restart in some versions.
  • PID exhaustion from a buggy app fork-storming in one pod — kill the pod, raise pid_max, set --pod-max-pids in kubelet config.
  • Cloud node terminated by autoscaler while still showing in kubectl get nodes — wait for node-controller to remove (default 5 min), then kubectl delete node.

When to escalate

  • Multi-node simultaneous NotReady → cluster-wide issue (CNI control plane, API server, networking) — fix at that layer, not per-node.
  • Control-plane node NotReady → engage cluster admin; etcd implications.
  • Recurring NotReady on the same node → hardware issue or persistent config drift; investigate or replace the node.
  • Managed service nodes — most providers prefer “terminate and replace” over manual repair.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.