Kubernetes Error Guide: 'etcdserver: leader changed' API Server Write Failures
Fix 'rpc error: code = Unavailable desc = etcdserver: leader changed' in Kubernetes: decode etcd Raft elections caused by slow disks, network flaps, and overload.
- #kubernetes-helm
- #troubleshooting
- #errors
- #etcd
Exact Error Message
When the kube-apiserver issues a write to etcd while the etcd cluster is electing a new leader, the gRPC call fails and the apiserver surfaces it to the client or logs it:
rpc error: code = Unavailable desc = etcdserver: leader changed
In the apiserver log it usually appears wrapped around a storage operation:
E0628 14:02:11.554321 1 status.go:71] apiserver received an error that is not an metav1.Status: rpcError{code:14, msg:"etcdserver: leader changed"}
W0628 14:02:11.554390 1 storage.go:128] etcdserver: leader changed; retrying Create on /registry/pods/default/web-7c9
code = Unavailable (gRPC code 14) is the headline: etcd told the apiserver the request could not be served right now because leadership moved mid-flight. The write was not committed and must be retried against the new leader.
What the Error Means
etcd uses the Raft consensus protocol. At any moment exactly one member is the leader and handles all writes; followers replicate the leader’s log. The leader holds its position by sending heartbeats within the --heartbeat-interval. If followers miss heartbeats for longer than the --election-timeout, they start a new election and a different member becomes leader.
Any client write (linearizable read or any mutation) that is in flight when leadership transfers is aborted with etcdserver: leader changed. This is expected, transient behavior — the apiserver’s etcd3 storage layer retries automatically and most writes succeed on the next attempt. The error only becomes a problem when elections happen frequently, which points to an unhealthy cluster: a slow leader that cannot send heartbeats in time, or followers that cannot acknowledge writes fast enough.
A single occurrence is noise. A storm of them means etcd is unstable, and you will see correlated symptoms: API latency spikes, context deadline exceeded, and controllers logging conflicts.
Common Causes
- Slow disk (fsync latency) — etcd fsyncs every Raft log entry. A
wal_fsyncp99 above ~25ms (HDD, throttled EBS, noisy neighbor) stalls heartbeats and triggers elections. - Network flaps / high latency — packet loss or latency between etcd peers exceeds the election timeout, so followers declare the leader dead.
- CPU starvation — the etcd process is throttled or competing with other control-plane workloads on an undersized node.
- Overload — too many writes (large clusters, hot controllers, runaway
kubectl applyloops) saturate the leader. - Mis-tuned timeouts —
--heartbeat-interval/--election-timeoutset too aggressively for the actual RTT between members. - Clock issues / GC pauses — large etcd heap GC pauses or VM pauses make the leader miss its heartbeat window.
How to Reproduce the Error
On a multi-member etcd cluster, force a leadership transfer and watch a concurrent client:
# Identify the current leader, then move leadership to another member
etcdctl endpoint status --cluster -w table
etcdctl move-leader <TARGET_MEMBER_ID>
While that runs, a tight write loop against the apiserver will occasionally fail:
while true; do kubectl create configmap probe-$RANDOM \
--from-literal=k=v -o name || break; done
error: failed to create configmap: rpc error: code = Unavailable desc = etcdserver: leader changed
In production you do not reproduce this deliberately — you observe it under disk or network pressure. Throttling the etcd data disk (low IOPS volume) reliably produces election storms.
Diagnostic Commands
# Cluster health and which member is leader (look for IS LEADER=true flipping)
etcdctl endpoint status --cluster -w table
etcdctl endpoint health --cluster
# Count leader changes over time — a climbing number means instability
etcdctl endpoint status -w json | grep -o '"raftTerm":[0-9]*'
# Disk fsync and backend commit latency from etcd's own metrics
curl -s http://127.0.0.1:2381/metrics | grep -E 'etcd_disk_wal_fsync_duration|etcd_disk_backend_commit_duration|etcd_server_leader_changes_seen_total'
# Peer round-trip and proposal health
curl -s http://127.0.0.1:2381/metrics | grep -E 'etcd_network_peer_round_trip|etcd_server_proposals_failed_total'
# etcd logs for elections and slow apply warnings
journalctl -u etcd --no-pager | grep -iE 'elected|lost leader|took too long|slow'
The single most useful signal is etcd_server_leader_changes_seen_total — if it increments more than a couple of times per hour, the cluster is unstable. Pair it with etcd_disk_wal_fsync_duration_seconds p99.
Step-by-Step Resolution
1. Confirm whether it is transient or a storm. Check etcd_server_leader_changes_seen_total. A flat or rarely-incrementing counter means the occasional error is harmless retry noise — no action needed.
2. Measure disk fsync latency. This is the most common root cause:
curl -s http://127.0.0.1:2381/metrics | grep wal_fsync_duration
If the p99 bucket exceeds ~25ms, etcd is disk-bound. Move etcd to dedicated low-latency SSD/NVMe (provisioned IOPS, not burst), and never share the volume with other I/O.
3. Check peer network. Inspect etcd_network_peer_round_trip_time_seconds. RTT spikes or packet loss between members cause missed heartbeats. Co-locate etcd members within a single low-latency region/AZ-set and verify no firewall or MTU issues on the peer port (2380).
4. Relieve CPU/memory pressure. Ensure etcd runs on a dedicated control-plane node with guaranteed CPU. Watch for GC pauses in logs; size --quota-backend-bytes and keep the DB compacted so the heap stays small.
5. Tune timeouts only if RTT genuinely warrants it. For higher-latency links, raise --heartbeat-interval (e.g. 250ms) and keep --election-timeout at roughly 10x the heartbeat. Never set election timeout below 5x your measured peer RTT.
6. Shed write load. Find hot writers (controllers in crash loops, frequent apply jobs, oversized resources) and throttle them. Reducing write QPS reduces leader stress directly.
Prevention and Best Practices
- Put etcd on dedicated, consistently fast storage and alert when
wal_fsyncp99 crosses 25ms — disk is the number-one cause of elections. - Run etcd on dedicated nodes; do not co-schedule it with the apiserver under load or with general workloads.
- Alert on
etcd_server_leader_changes_seen_totalrate; a healthy cluster changes leaders only on planned maintenance. - Keep members within low-latency network proximity and monitor peer round-trip time.
- Keep the etcd database small via compaction and defrag so GC pauses do not stall heartbeats.
- Size timeouts for your real RTT rather than copying defaults from a different topology. More patterns in our Kubernetes & Helm guides.
Related Errors
- etcd request timed out — the slow-disk/overload cousin where the request never completes.
- context deadline exceeded — apiserver-to-etcd calls that time out, often during election storms.
- mvcc: database space exceeded — a full backend that also destabilizes writes.
Frequently Asked Questions
Is etcdserver: leader changed always a problem? No. A single occurrence during planned maintenance, a rolling restart, or an automatic election is normal and the apiserver retries transparently. Only a high rate of these errors signals an unhealthy cluster.
Why did my kubectl apply fail with this instead of being retried? The apiserver retries internally, but if the election lasts longer than the request’s deadline the error propagates to your client. Re-running the command almost always succeeds once a new leader is stable.
Will adding more etcd members make this better? Not usually. More members mean more peers that must acknowledge each write, which can increase election sensitivity. Stick to 3 or 5 members and fix the underlying disk/network instead.
How do I tell disk from network as the cause? Compare etcd_disk_wal_fsync_duration_seconds against etcd_network_peer_round_trip_time_seconds. Whichever shows elevated p99 at the moments leader changes increment is your culprit; disk is the more common of the two.
Can I just raise the election timeout to stop the errors? It can mask mild instability, but if disk fsync is the real cause, a longer timeout only delays elections while write latency stays bad. Fix the storage first; tune timeouts only for genuine RTT.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.