Kubernetes Error Guide: 'PLEG is not healthy' Node NotReady from Kubelet
Fix 'PLEG is not healthy: pleg was last seen active ... ago' in Kubernetes: diagnose hung containerd/docker, slow runtime relisting, and node NotReady flapping.
- #kubernetes-helm
- #troubleshooting
- #errors
- #kubelet
Exact Error Message
The kubelet logs this when its Pod Lifecycle Event Generator cannot reach the container runtime in time, and the node is marked NotReady:
I0628 11:47:03.882190 1823 kubelet.go:1820] "skipping pod synchronization" err="PLEG is not healthy: pleg was last seen active 3m12.45s ago; threshold is 3m0s"
E0628 11:47:03.901774 1823 kubelet_node_status.go:447] "Error updating node status, will retry" err="PLEG is not healthy"
The corresponding node condition appears in kubectl describe node:
Conditions:
Type Status Reason Message
---- ------ ------ -------
Ready False KubeletNotReady PLEG is not healthy: pleg was last seen active 3m12.45s ago; threshold is 3m0s
The key phrase pleg was last seen active 3m12.45s ago; threshold is 3m0s means the runtime listing loop has not completed a cycle within the 3-minute health threshold.
What the Error Means
The PLEG (Pod Lifecycle Event Generator) is the kubelet component that tracks container state. On a timer (default every 1s) it calls the container runtime (containerd or Docker/dockershim/CRI-O) to relist all pods and containers, computes what changed, and emits lifecycle events that drive pod sync.
PLEG records a timestamp each time a relist completes successfully. A separate Healthy() check compares “now” against that timestamp; if the gap exceeds the 3-minute threshold (relistThreshold), PLEG reports unhealthy. The kubelet then refuses to sync pods and flips the node to NotReady, which can evict workloads.
The root meaning is almost always the same: a call into the container runtime hung or got very slow, so the relist loop stalled. PLEG itself is rarely the bug — it is the messenger reporting that containerd/dockerd did not answer in time. This is why “PLEG is not healthy” is really a symptom of a sick or overloaded runtime, slow disk, or too many containers per node.
Common Causes
- Hung container runtime —
containerdordockerdis deadlocked, restarting, or blocked on a stuck shim/container. - Slow disk I/O — runtime image/snapshotter operations stall on a saturated or failing disk, so relist calls block.
- Node resource exhaustion — CPU/memory pressure starves the kubelet/runtime threads.
- Too many pods/containers — relisting hundreds of containers per cycle exceeds the threshold on a busy node.
- Runtime API latency — CRI calls (
ListPodSandbox,ListContainers) are slow due to backend issues. - Network/registry stalls inside CRI — image operations holding runtime locks.
- Kernel/cgroup issues — D-state (uninterruptible) processes hanging the runtime.
How to Reproduce the Error
The reliable trigger is a stalled runtime. On a test node you can pause the runtime to simulate a hang:
# Freeze the containerd process to stall CRI relist calls (test node only)
sudo kill -STOP $(pgrep -x containerd)
# Wait past the 3-minute threshold, then watch the node go NotReady
kubectl get node <NODE> -w
NAME STATUS ROLES AGE VERSION
worker-2 Ready <none> 40d v1.29.4
worker-2 NotReady <none> 40d v1.29.4
Resume with sudo kill -CONT $(pgrep -x containerd) and the node returns to Ready after the next successful relist. In production this happens on its own when the runtime is genuinely stuck.
Diagnostic Commands
# Confirm the node condition and reason
kubectl describe node <NODE> | grep -A3 'Ready'
# Kubelet logs around the PLEG threshold breach
journalctl -u kubelet --no-pager | grep -iE 'PLEG|relist|skipping pod synchronization'
# Is the runtime alive and answering? Check service state and CRI responsiveness
systemctl status containerd --no-pager
journalctl -u containerd --no-pager | tail -50
# PLEG relist latency from kubelet metrics (p99 climbing toward 3m is the smoking gun)
curl -s http://127.0.0.1:10255/metrics 2>/dev/null | grep kubelet_pleg_relist_duration_seconds
# Disk and process state on the node — look for high iowait and D-state procs
iostat -x 2 3
ps -eo pid,stat,wchan,cmd | grep ' D'
kubelet_pleg_relist_duration_seconds is the most direct signal: when its p99 trends toward 3 minutes, a NotReady flip is imminent. Correlate with containerd log gaps and disk iowait.
Step-by-Step Resolution
1. Check the runtime first. Confirm containerd/dockerd is running and responsive. If crictl ps or systemctl status hangs, the runtime is your problem, not the kubelet.
2. Look for a stuck container or shim. A single wedged container can block relist. Identify D-state processes (ps -eo stat) and the offending sandbox; killing the stuck shim often unblocks the relist loop.
3. Inspect disk health. Run iostat -x. Sustained high %util/await means the runtime’s image/snapshot operations are stalling. Move to faster storage or relieve the I/O hog.
4. Relieve node pressure. If CPU/memory is exhausted, the runtime cannot service CRI calls promptly. Reduce pod density or scale the node up; ensure kubelet has reserved resources (--kube-reserved, --system-reserved).
5. Restart the runtime as a recovery step. If it is genuinely hung and unrecoverable, restarting containerd clears the stall; the kubelet reconnects and PLEG recovers on the next successful relist. Drain the node first when possible.
6. Reduce container count per node. If relist is simply slow because hundreds of containers are listed each cycle, cap pods per node and spread workloads across more nodes.
Prevention and Best Practices
- Alert on
kubelet_pleg_relist_duration_secondsp99 and on runtime restart counts — both lead PLEG-driven NotReady flips. - Keep the container runtime patched; many PLEG incidents trace to known containerd/runc deadlock bugs.
- Use fast, dedicated disks for the runtime’s data dir and alert on disk
iowait. - Reserve resources for the kubelet and system daemons so the runtime is never starved.
- Cap pods per node sensibly; very high density lengthens every relist cycle.
- Drain nodes before runtime maintenance so a stall does not evict live workloads. More node-health patterns in our Kubernetes & Helm guides.
Related Errors
- node not ready — the broader NotReady condition PLEG is one cause of.
- failed to sync pod — pod sync failures that share runtime root causes.
- Evicted: the node was low on resource — what can follow when a stressed node sheds pods.
Frequently Asked Questions
Is PLEG itself the bug? Almost never. PLEG just reports that a relist did not complete in time. The actual fault is in the container runtime, the disk, or node resource pressure that prevented the runtime from answering CRI calls promptly.
Why is the threshold 3 minutes? That is the kubelet’s relistThreshold — a deliberately generous window so brief runtime hiccups do not flap the node. If you breach 3 minutes, the runtime was unresponsive for a long time, which is genuinely unhealthy.
The node went NotReady then recovered on its own — what happened? The runtime stalled long enough to breach the threshold, then recovered (a slow operation finished, or the runtime restarted). The next successful relist updated PLEG’s timestamp and the kubelet marked the node Ready again.
Does restarting the kubelet fix it? Usually not, because the kubelet is not the stuck component. Restarting the container runtime is the effective recovery action when it is hung. Restart the kubelet only after the runtime is healthy.
How do I stop pods being evicted during these flaps? Tune pod eviction tolerations and --node-monitor-grace-period, but the durable fix is preventing the stall: healthy runtime, fast disk, and reserved resources so relist never approaches the threshold.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.