Kafka Error Guide: 'NOT_LEADER_OR_FOLLOWER' stale partition metadata on clients
Fix Kafka clients hitting NOT_LEADER_OR_FOLLOWER after a leader moves. Understand metadata refresh, retries, advertised.listeners, and why it self-heals.
- #kafka
- #troubleshooting
- #errors
- #partitions
When a partition leader moves, in-flight clients briefly keep talking to the old broker until they refresh their cached metadata. During that window you see NOT_LEADER_OR_FOLLOWER errors in producer and consumer logs. This is usually a transient, self-healing condition, but the same symptom can mask a real misconfiguration. This guide explains the metadata refresh cycle and how to tell a harmless blip from a genuine problem.
Exact Error Message
The client logs the rejection and announces that it will refresh metadata and retry. On a producer you will see something like this from NetworkClient and the Sender:
[2026-06-24 09:41:22,118] WARN [Producer clientId=payments-svc] Received invalid metadata error in produce request on partition payments-7 due to org.apache.kafka.common.errors.NotLeaderOrFollowerException; will refresh metadata and retry (org.apache.kafka.clients.producer.internals.Sender)
[2026-06-24 09:41:22,119] WARN [Producer clientId=payments-svc] Got error produce response with correlation id 5521 on topic-partition payments-7, retrying (1 attempts left). Error: NOT_LEADER_OR_FOLLOWER (org.apache.kafka.clients.producer.internals.Sender)
[2026-06-24 09:41:22,260] DEBUG [Producer clientId=payments-svc] Updating last seen epoch for partition payments-7, refreshing metadata (org.apache.kafka.clients.NetworkClient)
A consumer shows the same root error during fetch:
[2026-06-24 09:41:25,004] INFO [Consumer clientId=ledger-1 groupId=ledger] Error while fetching from partition payments-7: NOT_LEADER_OR_FOLLOWER. Will rediscover the leader (org.apache.kafka.clients.consumer.internals.Fetcher)
What the Error Means
NOT_LEADER_OR_FOLLOWER is a retriable broker response. It means the broker the client contacted is no longer the leader (and is not a follower configured to serve that request) for the partition. Crucially, a leader does exist somewhere in the cluster; the client simply has stale routing information.
Clients cache topic metadata, including which broker leads each partition, and refresh it on a schedule controlled by metadata.max.age.ms (default 300000 ms, five minutes). When a leader migrates, the client’s cache is briefly wrong. Receiving NOT_LEADER_OR_FOLLOWER forces an immediate metadata refresh out of band, after which the retried request lands on the correct broker. This is why the condition normally clears on its own within a few hundred milliseconds.
Common Causes
- Leader changed and client metadata is not yet refreshed. A broker restart, partition reassignment, or preferred-leader election moved the leader. The client’s cached leader is momentarily stale. This is the benign, expected case.
metadata.max.age.msset too high. If you have disabled the error-triggered refresh path or are reasoning about steady-state staleness, an excessive max age keeps clients pointing at old leaders longer than necessary.- Transient network blip. A short partition between client and broker can make the client see a leader change it would otherwise have tracked smoothly.
- Stale DNS for the bootstrap or broker hosts. If hostnames resolve to old IPs after infrastructure changes, the client keeps reaching the wrong endpoints.
- Client cannot reach the new leader due to
advertised.listeners. If brokers advertise addresses the client cannot route to, the refreshed metadata points at an unreachable leader and the errors persist instead of clearing. - Retries or delivery timeout too low. If
retries,delivery.timeout.ms, or the consumer’s retry budget are too small, requests fail and surface as errors before the refresh-and-retry cycle completes.
How to Reproduce the Error
Trigger a controlled leader move and watch a running client:
- Start a producer writing steadily to a multi-partition topic with replication factor 3.
- Note the current leader of one partition with
kafka-topics.sh --describe. - Restart that leader broker (or run a preferred-leader election from an admin host).
- The leader migrates to a follower. The producer logs
NOT_LEADER_OR_FOLLOWER, refreshes metadata, and resumes within a second.
To reproduce the persistent (non-self-healing) variant, misconfigure advertised.listeners so the broker advertises an address the client host cannot reach, then move the leader. The client refreshes but cannot connect to the new leader, so the errors continue.
Diagnostic Commands
Confirm the current leader for the affected partition:
kafka-topics.sh --bootstrap-server broker1:9092 \
--describe --topic payments
Topic: payments Partition: 7 Leader: 2 Replicas: 2,3,1 Isr: 2,3,1
A healthy Leader and full Isr here mean the cluster side is fine and the problem is client routing or timing. Check consumer group assignment and lag to see whether the client recovered:
kafka-consumer-groups.sh --bootstrap-server broker1:9092 \
--describe --group ledger
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
ledger payments 7 1048221 1048221 0 ledger-1-abc
Grep the client logs to see whether the errors are clearing or repeating:
grep -E "NOT_LEADER_OR_FOLLOWER|refresh metadata|rediscover the leader" \
/var/log/payments-svc/app.log | tail -n 20
Verify what each broker advertises and confirm reachability from the client host:
grep -E "^advertised.listeners|^listeners" /etc/kafka/server.properties
advertised.listeners=PLAINTEXT://broker2.internal:9092
ss -tnp | grep 9092
journalctl -u kafka --since "10 min ago" | grep -Ei "leader|reassign|restart"
Resolve the advertised hostname from the client side to catch stale DNS, then confirm a TCP path to the new leader’s port using ss after a connection attempt. If the advertised name does not resolve or the port is unreachable from the client, you have found a persistent cause.
Step-by-Step Resolution
A worked example for payments-7 after a broker restart:
- Confirm a leader exists.
kafka-topics.sh --describe --topic paymentsshowsLeader: 2, full ISR. The cluster is healthy, so this is a client-side timing or routing issue. - Check whether it is self-healing. Tail the client log. If
NOT_LEADER_OR_FOLLOWERis followed byrefresh metadataand then successful produces, the condition has already cleared. No action needed beyond confirming lag returned to normal viakafka-consumer-groups.sh --describe. - If errors persist, test reachability to the new leader. From the client host, resolve
broker2.internaland check the route to port 9092. A failure points toadvertised.listenersor DNS. - Fix advertised.listeners. If brokers advertise an unroutable address, set
advertised.listenersto a name or IP the clients can actually reach, then restart the affected brokers. This is the most common cause of a stuck, non-healing version of this error. - Fix stale DNS. If the bootstrap or broker hostnames resolve to old IPs, update or flush DNS so clients reach the live endpoints.
- Right-size client retry budget. Ensure
retriesanddelivery.timeout.mson producers are high enough that the refresh-and-retry cycle completes before the request is failed. The defaults (delivery.timeout.ms=120000) are usually sufficient; do not lower them aggressively. - Re-verify. Watch the client log; errors should stop and lag should hold at zero.
Lowering metadata.max.age.ms is rarely the right fix, because the error itself already forces an immediate refresh. Reach for it only if you have a specific reason clients are not refreshing on error.
Prevention and Best Practices
- Treat
NOT_LEADER_OR_FOLLOWERduring planned restarts, reassignments, and preferred elections as expected and transient; alert only when it persists. - Keep producer
retriesanddelivery.timeout.msat sane, generous values so brief leader moves never surface as application errors. - Set
advertised.listenersto addresses every client network can route to, and validate it after any infrastructure change. - Use stable DNS names rather than hard-coded IPs for bootstrap and broker addresses.
- Roll broker restarts one at a time and use rack-aware placement so leader moves are smooth and bounded.
- Surface client-side error rates alongside cluster metrics, for example via /dashboard/incident-response/, so a persistent version stands out from normal churn.
Related Errors
- LEADER_NOT_AVAILABLE: Returned when no leader currently exists (often mid-election or during an offline partition), versus NOT_LEADER_OR_FOLLOWER where a leader exists but the client targeted the wrong broker.
- Offline partitions / OfflinePartitionsCount > 0: A genuinely leaderless state that retries alone cannot fix; distinct from this self-healing case.
- TimeoutException on producer send: Often the downstream result when retries or delivery timeout are too low to outlast the metadata refresh window.
- UnknownTopicOrPartitionException: Another metadata-related client error, usually from a topic not yet propagated rather than a moved leader.
More Kafka guides live at /categories/kafka/.
Frequently Asked Questions
Q: Is NOT_LEADER_OR_FOLLOWER something I need to fix? Usually not. It is a retriable error that the client resolves automatically by refreshing metadata and retrying against the new leader. You only need to act if it persists beyond normal leader-move windows, which points to advertised.listeners, DNS, or retry settings.
Q: Should I lower metadata.max.age.ms to fix it? Almost never. The error already forces an out-of-band metadata refresh immediately, so the five-minute default does not delay recovery during a leader move. Lower it only if you have evidence clients are not refreshing on error.
Q: Why do the errors keep coming back instead of clearing?
Persistent NOT_LEADER_OR_FOLLOWER means the client refreshes metadata but still cannot reach the new leader. The usual culprits are advertised.listeners pointing at an unroutable address, stale DNS, or a network path that blocks the new leader’s port from the client host.
Q: What is the difference between this and an offline partition?
With NOT_LEADER_OR_FOLLOWER a leader exists; the client just cached the wrong one. With an offline partition there is no leader at all, so kafka-topics.sh --describe shows Leader: none and retries cannot help until a replica is restored.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.