Kafka Error Guide: 'Partition reassignment for topic-0 failed' Stuck and Failed Reassignments
Why kafka-reassign-partitions.sh reports a reassignment as still in progress or failed, how to diagnose throttles, dead brokers, and disk, and how to recover.
- #kafka
- #troubleshooting
- #errors
- #partitions
Partition reassignment is how Kafka moves replicas between brokers to rebalance load, drain a node, or expand a cluster. When it stalls or fails, partitions can sit half-moved for hours, throttle bandwidth is wasted, and you cannot start a new plan because the old one is still registered. This guide explains what the failure means, why it happens, and how to recover using read-only diagnostics.
Exact Error Message
The most common signal comes from kafka-reassign-partitions.sh --verify, which reports each partition as completed, still in progress, or failed:
Status of partition reassignment:
Reassignment of partition orders-0 is still in progress.
Reassignment of partition orders-3 failed.
Reassignment of partition payments-1 completed successfully.
On the controller, the broker server.log and controller.log show the underlying problem:
[2026-06-29 14:02:11,883] WARN [Controller id=1] Partition reassignment for orders-0 failed: target replica 5 is not alive (kafka.controller.KafkaController)
[2026-06-29 14:02:11,901] ERROR [Controller id=1] Replica assignment failed for partition orders-3; new replicas [4,5,6] not all reachable (kafka.controller.KafkaController)
[2026-06-29 14:02:12,044] INFO [ReassignPartitionsCommand] There is an existing assignment running; cannot start a new reassignment until it completes (kafka.admin.ReassignPartitionsCommand)
The three phrases you will see most often are “is still in progress”, “failed”, and “There is an existing assignment running”.
What the Error Means
A reassignment tells the controller to change the replica set for a partition. The controller adds the new replicas as followers, waits for them to catch up and join the in-sync replica set (ISR), then removes the old replicas. “Still in progress” means the new replicas have not yet caught up to the leader. “Failed” means the controller could not complete the move at all, usually because a target replica is unreachable or a log directory is offline. “There is an existing assignment running” means a previous plan is still registered in cluster metadata, so a new plan is rejected until the old one finishes or is cancelled.
A reassignment is not atomic and has no built-in timeout. If replication never catches up, it stays in progress indefinitely.
Common Causes
- Target broker down or unreachable mid-move. A destination broker crashed or lost network connectivity after the plan started, so its replica can never fetch and join the ISR.
- Replication throttle set too low. A
--throttlevalue that is smaller than the incoming write rate means the new replica falls further behind instead of catching up, so the move never completes. - Insufficient disk on the target broker. The destination log directory fills up, replica creation fails, and the partition stalls.
- An existing in-progress reassignment blocking a new one. Kafka allows only one reassignment to be registered. Starting another before the first completes yields “There is an existing assignment running”.
- Invalid JSON or bad broker IDs in the plan. A malformed reassignment file, or broker IDs that do not exist, causes “Replica assignment failed”.
- Controller failover during reassignment. If the active controller changes while a move is underway, the new controller must rebuild state; transient failures can surface in
--verifyoutput. - Log directory offline on the destination. With JBOD, a failed disk takes one log dir offline and any replica assigned there cannot be created.
How to Reproduce the Error
In a non-production cluster you can recreate the “still in progress” state reliably:
- Create a topic with replication factor 3 across brokers 1, 2, 3.
- Generate continuous load against the topic so the leader has a high write rate.
- Submit a reassignment that moves a partition to broker 5 with a very low throttle (for example a few KB/s).
- Run
--verify: the new replica cannot keep up, so it never enters the ISR and the partition reports as still in progress.
To reproduce “failed”, submit a plan that targets a broker ID that is currently stopped. The controller logs “target replica is not alive” and --verify reports the partition as failed.
Diagnostic Commands
Start with the reassignment status itself. The --verify action is read-only and reports per-partition state without changing anything:
kafka-reassign-partitions.sh --bootstrap-server localhost:9092 \
--reassignment-json-file plan.json --verify
Status of partition reassignment:
Reassignment of partition orders-0 is still in progress.
Reassignment of partition orders-3 failed.
Clearing broker-level throttles on brokers 1,2,3
Compare the current replica list against the ISR to see which new replicas have not caught up:
kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic orders
Topic: orders Partition: 0 Leader: 1 Replicas: 1,2,5 Isr: 1,2
Topic: orders Partition: 3 Leader: 2 Replicas: 4,5,6 Isr: 2
Here partition 0 is moving to broker 5, but 5 is not yet in the ISR. Check disk usage on the target broker’s log directories:
kafka-log-dirs.sh --bootstrap-server localhost:9092 \
--broker-list 5 --describe
{"brokers":[{"broker":5,"logDirs":[{"logDir":"/var/lib/kafka/data",
"error":null,"partitions":[{"partition":"orders-0","size":0,"offsetLag":841233,"isFuture":true}]}]}]}
A large offsetLag confirms the replica is far behind. Inspect the controller’s logs for the root cause:
journalctl -u kafka --since "1 hour ago" | grep -iE "reassign|not alive|offline"
grep -iE "reassign|Replica assignment failed|existing assignment" \
/var/lib/kafka/logs/controller.log
Confirm the target broker is actually listening:
ss -ltnp | grep 9092
Step-by-Step Resolution
Consider a drain of broker 3 onto broker 5 that has been stuck in progress for an hour.
- Run
--verifyand confirm which partitions are still in progress versus failed. In this caseorders-0is still in progress. - Describe the topic and note broker 5 is in the replica list but not the ISR, with a growing offset lag — the replica is falling behind, not catching up.
- Check the throttle. The plan was submitted with a 5 MB/s throttle while the topic ingests 20 MB/s. The follower can never close the gap. The fix is to raise the throttle well above the write rate (or remove it) so the replica can catch up; this is done by re-running the reassignment tool with a higher
--throttle, described here rather than executed. - Verify target disk with
kafka-log-dirs.sh. If the destination is near full, free space or pick a different target broker before retrying. - Confirm target broker health with
ssandjournalctl. If broker 5 was down, bring it back online; the controller resumes the move automatically once the replica can fetch. - Wait and re-verify. With an adequate throttle and a healthy target, the replica catches up, joins the ISR, and
--verifyreports “completed successfully”, which also clears the throttle.
For a genuinely failed plan that points at a dead or nonexistent broker, the move cannot proceed. You must cancel the registered reassignment before submitting a corrected plan; cancellation is a separate administrative step and is not shown here. Always let one reassignment finish or be cancelled before starting another to avoid “There is an existing assignment running”.
If you orchestrate reassignments and recovery through automation, the incident response dashboard can capture the controller log context alongside the --verify output.
Prevention and Best Practices
Size the throttle above the topic’s peak write rate so replicas can always catch up, and remove the throttle once the move completes. Check target disk headroom with kafka-log-dirs.sh before submitting a plan, and confirm every broker ID in the plan is alive. Validate the JSON plan offline first. Move a small number of partitions at a time rather than rebalancing the whole cluster in one plan, and never start a second reassignment until --verify reports the first as complete. Keep the controller healthy and avoid rolling restarts during an active move.
Related Errors
- Leader election failed / No leader for partition — when a move leaves no in-sync replica eligible to lead.
- NotEnoughReplicasException — producers with
acks=allfail while replicas are out of the ISR during a move. - UnknownTopicOrPartitionException — a plan references a topic or partition that no longer exists.
- KafkaStorageException — a log directory on the destination broker is offline.
See the full Kafka category for more guides.
Frequently Asked Questions
Q: Does kafka-reassign-partitions.sh —verify change anything?
No. --verify only reports per-partition status and clears throttles that were previously set by a completed reassignment. It does not move data or alter the plan, which makes it safe to run repeatedly while you diagnose.
Q: Why is my reassignment stuck “in progress” forever?
The new replica is not catching up. The usual cause is a throttle set below the topic’s write rate, a slow or full target disk, or a target broker that is down. Compare Replicas to Isr with --describe and watch offsetLag in kafka-log-dirs.sh.
Q: Can I start a new reassignment while one is running?
No. Kafka registers only one reassignment at a time, so a new plan is rejected with “There is an existing assignment running”. Wait for --verify to report completion, or cancel the existing reassignment first.
Q: A partition shows “failed” — what now? Read the controller log for the reason, usually “target replica is not alive” or an offline log dir. Fix the underlying problem (bring the broker up, free disk, correct the broker ID), cancel the failed plan, and resubmit a corrected one.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.