Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Kubernetes & Helm By James Joyner IV · · 9 min read

Kubernetes Error Guide: 'node is unreachable' Taint and Pods Stuck Terminating

Fix Kubernetes node.kubernetes.io/unreachable: kubelet-to-apiserver heartbeat loss, NotReady nodes, pods stuck Terminating/Unknown, and network partitions.

  • #kubernetes-helm
  • #troubleshooting
  • #errors
  • #node

Exact Error Message

When the control plane stops hearing from a node, the node-lifecycle controller marks it unreachable and applies a taint; its pods drift into Terminating or Unknown:

$ kubectl get nodes
NAME       STATUS     ROLES    AGE   VERSION
worker-4   NotReady   <none>   91d   v1.30.2

$ kubectl describe node worker-4 | grep -A3 Taints
Taints:  node.kubernetes.io/unreachable:NoExecute
         node.kubernetes.io/unreachable:NoSchedule

$ kubectl get pods -o wide | grep worker-4
api-66d   1/1   Terminating   0   2d   10.0.4.7   worker-4
job-9f2   1/1   Unknown       0   2d   10.0.4.8   worker-4

The node event log shows the heartbeat loss:

  Warning  NodeNotReady  4m  node-controller  Node worker-4 status is now: NodeNotReady

What the Error Means

Each node’s kubelet posts a heartbeat to the API server by updating its Lease object (and the node Status). The node-lifecycle controller in kube-controller-manager watches those leases. If it stops receiving updates for --node-monitor-grace-period (40s by default), it marks the node Ready=Unknown, the status shows NotReady, and it applies the taint node.kubernetes.io/unreachable.

That taint carries two effects: NoSchedule (no new pods) and NoExecute (existing pods are evicted after their toleration window, default 300s). But here is the catch: if the node is truly partitioned, the kubelet there cannot confirm the pods are gone, so the API objects stay Terminating/Unknown indefinitely. The control plane cannot prove the workloads stopped, only that it can no longer talk to the node.

Common Causes

  • Network partition — the node lost connectivity to the API server (security group, route, VPN, or NIC failure).
  • Kubelet crashed or hung — the kubelet process died, froze, or its certificate expired, so no lease updates are posted.
  • Node powered off / terminated — a VM was stopped or reclaimed (spot interruption) without draining.
  • Control-plane endpoint unreachable — DNS or load-balancer issues between node and apiserver.
  • Clock skew / expired credentials — kubelet auth fails, heartbeats rejected.
  • Resource starvation — the node is so overloaded the kubelet cannot post in time.

How to Reproduce the Error

On a disposable worker, stop the kubelet (or sever its path to the API server) and watch the node go unreachable:

# On a test node only — stops heartbeats
sudo systemctl stop kubelet

# From a management host, observe the transition
kubectl get nodes -w

After the grace period the node flips to NotReady, gains the node.kubernetes.io/unreachable taint, and pods on it enter Terminating/Unknown. Restarting the kubelet restores the heartbeat and clears the taint.

Diagnostic Commands

# Node status, last heartbeat, and conditions
kubectl get nodes -o wide
kubectl describe node <NODE> | grep -A6 Conditions

# Inspect the node Lease (last renew time tells you when heartbeats stopped)
kubectl get lease <NODE> -n kube-node-lease -o yaml | grep renewTime

# Pods stranded on the node
kubectl get pods -A -o wide --field-selector spec.nodeName=<NODE>

# From the node itself (if reachable): is the kubelet alive and talking to apiserver?
ssh <NODE> 'systemctl status kubelet --no-pager; journalctl -u kubelet --no-pager -n 50'

# Test connectivity from node to the API server endpoint
ssh <NODE> 'curl -sk https://<APISERVER>:6443/healthz; echo'

The Lease renewTime pins the exact moment heartbeats stopped, and the kubelet journal tells you whether the kubelet is down, partitioned, or rejected.

Step-by-Step Resolution

1. Decide: transient or dead. Check the Lease renewTime and try to reach the node. A brief network blip self-heals once heartbeats resume. A node that is off or permanently partitioned will not recover on its own.

2. Restore the kubelet/connectivity if recoverable. SSH in (if possible) and inspect systemctl status kubelet and journalctl -u kubelet. Restart a crashed kubelet, renew an expired kubelet cert, or fix the network path. Once the lease updates again, the controller removes the taint and the node returns to Ready.

3. Confirm whether workloads actually moved. Controllers (Deployments/ReplicaSets) recreate pods elsewhere after the NoExecute toleration expires (~5 min). StatefulSet pods do not — Kubernetes will not start a replacement while the old pod might still be running, to avoid split-brain.

4. For a permanently dead node, delete it cleanly. Removing the Node object forces the control plane to give up on its pods so they can reschedule:

kubectl delete node <NODE>

Only do this once you are certain the node is truly gone, or you risk two instances of a stateful pod running at once.

5. Clear stranded pods if needed. Pods stuck Terminating because their node never confirmed deletion are released when the Node object is deleted (or the node returns). Avoid --force --grace-period=0 on stateful workloads unless you have confirmed the node is dead.

6. Address the root cause. Fix the spot-interruption handling, network path, or resource pressure so the node does not silently drop again.

Prevention and Best Practices

  • Run a node-problem-detector and alert on NotReady/Unknown nodes and stale kube-node-lease renew times.
  • Use graceful node shutdown and cloud-provider drain hooks (especially for spot/preemptible nodes) so pods are evicted cleanly before power-off.
  • Keep tolerationSeconds for the unreachable taint tuned to your failover SLA — shorter for fast rescheduling, but not so short that brief blips cause churn.
  • Spread replicas across nodes and zones with topology constraints so one unreachable node never takes a whole service down.
  • Monitor kubelet certificate expiry and clock sync; both silently kill heartbeats.
  • For StatefulSets, automate fencing so a confirmed-dead node’s pods can be safely replaced. More in our Kubernetes & Helm guides.

Frequently Asked Questions

Why are my pods stuck Terminating forever on an unreachable node? Because the kubelet on that node must confirm the pod is gone before the API object is removed, and a partitioned/dead node can never send that confirmation. The objects clear only when the node returns or you delete the Node object.

What is the difference between NotReady and the unreachable taint? NotReady is the node’s status (Ready condition is False or Unknown). The node.kubernetes.io/unreachable taint is what the node-lifecycle controller adds when Ready=Unknown specifically because heartbeats stopped, triggering NoExecute eviction of pods after their toleration window.

Is it safe to kubectl delete node to recover faster? Only when you are certain the node is truly dead. Deleting the Node object lets stateful pods reschedule, but if the original node is merely partitioned and still running, you can end up with two live instances of the same pod — a split-brain that corrupts data.

Why did my Deployment recover but my StatefulSet did not? Deployment pods are interchangeable, so the controller recreates them on healthy nodes after the eviction timeout. StatefulSet pods have stable identity and at-most-one semantics; Kubernetes refuses to start a replacement while the original might still be running on the unreachable node, until that node is removed or fenced.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.