Kafka Error Guide: 'CoordinatorNotAvailableException' Group Coordinator Down
Fix Kafka CoordinatorNotAvailableException: resolve __consumer_offsets unavailability, coordinator load-in-progress, offline partitions, and under-replicated offsets topic.
- #kafka
- #troubleshooting
- #errors
- #consumer
Exact Error Message
A consumer or admin operation that cannot reach its group coordinator fails like this:
org.apache.kafka.common.errors.CoordinatorNotAvailableException: The coordinator is not available.
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$FindCoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:920)
The lead-up in the client log shows the consumer repeatedly trying to discover the coordinator:
[Consumer clientId=consumer-1, groupId=orders-service] Group coordinator broker-2:9092 (id: 2147483645) is unavailable or invalid due to cause: coordinator unavailable. Rediscovery will be attempted.
[Consumer clientId=consumer-1, groupId=orders-service] FindCoordinator request failed: COORDINATOR_NOT_AVAILABLE
[Consumer clientId=consumer-1, groupId=orders-service] Coordinator unavailable; discovering new coordinator
You may also see the closely related COORDINATOR_LOAD_IN_PROGRESS while a coordinator is loading offsets.
What the Error Means
Every consumer group is managed by a group coordinator — a specific broker that owns the __consumer_offsets partition for that group (the partition is chosen by hashing the group id). The coordinator handles join/sync (rebalancing), heartbeats, and offset commits. To use a group, a client first sends a FindCoordinator request to learn which broker is the coordinator, then talks to that broker.
CoordinatorNotAvailableException means the broker that should be the coordinator for this group is not currently able to serve that role. The most common reasons: the __consumer_offsets partition for the group has no available leader, the coordinator broker is restarting or has just taken over and is still loading offsets into memory (COORDINATOR_LOAD_IN_PROGRESS), or the offsets topic is under-replicated/offline. It is usually transient and resolves once the partition has a healthy leader and finishes loading — but a persistent occurrence points at a real availability problem with __consumer_offsets.
Common Causes
__consumer_offsetspartition offline or leaderless: The partition that maps to the group has no in-sync leader, so no broker can act as coordinator.- Coordinator broker restarting: During a rolling restart or crash recovery, the new coordinator loads offsets before serving; clients see
COORDINATOR_LOAD_IN_PROGRESSthen transient unavailability. - Under-replicated offsets topic:
__consumer_offsetshas too few in-sync replicas (e.g., after broker loss), preventing leadership/availability. - Offsets topic misconfigured at first start:
offsets.topic.replication.factorset higher than available brokers means the topic never fully creates, so coordination never works. - Broker overload: A coordinator broker under heavy load is slow to respond to
FindCoordinator, surfacing as intermittent unavailability. - Network partition: The client can reach bootstrap but not the specific coordinator broker.
How to Reproduce the Error
On a single-broker test cluster, set the offsets topic to require more replicas than exist, then start a consumer:
# server.properties on a 1-broker cluster
offsets.topic.replication.factor=3
offsets.topic.num.partitions=50
With only one broker, __consumer_offsets cannot reach replication factor 3, partitions stay unhealthy, and any consumer’s FindCoordinator returns COORDINATOR_NOT_AVAILABLE. Restarting the broker that leads the group’s offsets partition while a consumer is active reproduces the transient COORDINATOR_LOAD_IN_PROGRESS → unavailable sequence.
Diagnostic Commands
Look for coordinator-load and availability events on the brokers:
grep -nE "COORDINATOR_NOT_AVAILABLE|COORDINATOR_LOAD_IN_PROGRESS|Loading group metadata|Finished loading offsets" /var/log/kafka/server.log | tail -40
Inspect the health of the internal offsets topic — this is the key check:
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic __consumer_offsets | grep -E "Leader: -1|Isr:" | head -30
List under-replicated partitions across the cluster (includes __consumer_offsets):
kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions
Check the group’s state and which broker is its coordinator:
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group orders-service --state
Confirm all expected brokers are alive and reachable:
kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | grep -E "^[0-9.]+:9092"
In KRaft mode, confirm the metadata quorum is healthy (an unstable controller delays leadership for offsets partitions):
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
Step-by-Step Resolution
- Check if it’s transient. During a restart or election the error self-heals within seconds once the coordinator finishes loading. If clients recover on retry, no action is needed beyond confirming the restart completed.
- Inspect
__consumer_offsets. If any partition showsLeader: -1or a shrunken ISR, that is why no coordinator is available. Bring the responsible broker(s) back so the offsets partitions regain a leader. - Fix under-replication. If the offsets topic is under-replicated after broker loss, restore the missing brokers; leadership and ISR recovery makes the coordinator available again.
- Fix first-start misconfiguration. If
offsets.topic.replication.factorexceeds your broker count, reduce it to a value the cluster can satisfy so the topic creates healthily. - Relieve coordinator overload. If the coordinator broker is saturated, reduce load or rebalance partitions so it can answer
FindCoordinatorpromptly. - Verify network path. Ensure clients can reach the specific coordinator broker, not just bootstrap.
- Confirm recovery with
kafka-consumer-groups.sh --describe --stateshowing the groupStableand committing offsets again.
Prevention and Best Practices
- Set
offsets.topic.replication.factorto at least 3 in production (and never above your broker count) so the coordinator survives a broker loss. - Monitor under-replicated partitions, and alert specifically when
__consumer_offsetsis affected — it impacts every consumer group. - Pace rolling restarts so only one broker is down at a time and offsets partitions always retain a leader.
- Keep the controller/metadata quorum healthy; slow leadership election delays coordinator availability after restarts.
- Avoid overloading brokers that host many
__consumer_offsetspartitions; spread leadership evenly. - For transient blips, ensure clients use sane retry/backoff so brief unavailability during elections does not surface as application errors. The free incident assistant can help confirm whether an occurrence is transient or a real offsets-topic outage.
Related Errors
COORDINATOR_LOAD_IN_PROGRESS/CoordinatorLoadInProgressException— the coordinator is loading offsets; retry shortly.NotCoordinatorException— the client contacted a broker that is no longer the coordinator (rediscovery needed).session timeout expired/ member left group — a related group-stability failure.Failed to update metadata— broader metadata-time failure that can accompany coordinator discovery problems.
Frequently Asked Questions
Is CoordinatorNotAvailableException always a serious problem? Often not. During elections, restarts, and offsets loading it is expected and transient; well-behaved clients retry and recover within seconds. It becomes serious when it persists, which signals a genuinely unavailable or under-replicated __consumer_offsets topic.
Why is __consumer_offsets so important here? That internal topic stores committed offsets and group metadata, and its partitions determine which broker is each group’s coordinator. If a group’s offsets partition has no leader, no broker can coordinate that group, producing this error.
What’s the difference from COORDINATOR_LOAD_IN_PROGRESS? COORDINATOR_LOAD_IN_PROGRESS means the correct coordinator is known but still reading offsets into memory — retry shortly. COORDINATOR_NOT_AVAILABLE means no coordinator can currently be designated, usually because the offsets partition lacks a healthy leader.
I set offsets.topic.replication.factor=3 on a 1-broker cluster and consumers can’t start — why? The offsets topic can’t reach replication factor 3 with one broker, so its partitions never become healthy and no coordinator is available. Set the factor to a value your broker count can satisfy (and raise it later as you add brokers).
Should my application crash on this error? No. Treat it as retryable. Configure reasonable retry/backoff so consumers ride out transient coordinator unavailability during normal cluster events instead of failing the application.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.