Kafka Partition Rebalancing Strategies

There are two completely different things people mean by “Kafka rebalancing,” and conflating them causes endless confusion. One is partition reassignment: moving partition replicas between brokers to balance load, drain a node, or expand a cluster. The other is consumer group rebalancing: redistributing partitions among the consumers in a group when membership changes. Both can hurt you if handled carelessly — an unthrottled reassignment can saturate your network and take down production, and a stop-the-world consumer rebalance can stall processing for seconds on every deploy. This guide covers strategies for both: reassignment with throttles, automated balancing with Cruise Control, and cooperative rebalancing for consumer groups, all on Kafka 3.x.

Partition reassignment: moving replicas between brokers

Kafka does not automatically move existing partitions when you add a broker. A fresh broker sits idle until you explicitly reassign partitions onto it. The tool for this is kafka-reassign-partitions, and the workflow has three phases: generate a plan, execute it, and verify.

First, describe which topics you want to rebalance in a JSON file:

{
  "version": 1,
  "topics": [
    {"topic": "orders"},
    {"topic": "events"}
  ]
}

Generate a proposed reassignment across the target broker set:

kafka-reassign-partitions --bootstrap-server broker0:9092 \
  --topics-to-move-json-file topics.json \
  --broker-list "0,1,2,3" \
  --generate

This prints a current assignment (save it — it is your rollback) and a proposed assignment. Save the proposed plan to reassignment.json, review it, then execute:

kafka-reassign-partitions --bootstrap-server broker0:9092 \
  --reassignment-json-file reassignment.json \
  --execute

Reassignment is asynchronous. The brokers begin replicating partition data to their new homes in the background while continuing to serve traffic. Check progress:

kafka-reassign-partitions --bootstrap-server broker0:9092 \
  --reassignment-json-file reassignment.json \
  --verify

Pro Tip: Always save the “Current partition replica assignment” that --generate prints before you execute anything. It is your only clean rollback path. If a reassignment goes wrong mid-flight, you feed that saved JSON back through --execute to put replicas back where they started.

Throttling: the setting that prevents self-inflicted outages

Here is the mistake that takes down clusters. You execute a large reassignment, and the brokers replicate the partition data as fast as the network allows. That replication competes directly with the produce and fetch traffic your applications depend on. Suddenly your p99 latency triples, consumers fall behind, and you have caused an incident by trying to balance the cluster.

The fix is a throttle. The --throttle flag caps replication bandwidth in bytes per second per broker:

kafka-reassign-partitions --bootstrap-server broker0:9092 \
  --reassignment-json-file reassignment.json \
  --execute --throttle 50000000

That limits reassignment replication to 50 MB/s per broker, leaving the rest of the network for live traffic. The right number depends on your hardware and headroom; a common approach is to start conservative, watch the impact on request latency, and raise it if there is slack.

A few throttle mechanics worth knowing:

The throttle is applied as a dynamic config on the brokers, so you can adjust it mid-reassignment with another --execute --throttle <new-value> without restarting anything.
When --verify reports the reassignment complete, it also removes the throttle. If you cancel a reassignment another way, the throttle config can linger and silently cap normal replication. Always run --verify to clean up, or remove the leader.replication.throttled.rate and follower.replication.throttled.rate configs manually.
Throttle too aggressively and a large reassignment can take days, during which under-replicated partitions persist. Balance speed against impact rather than just minimizing impact.

Pro Tip: A forgotten throttle is a classic cause of “why is replication so slow” mysteries weeks later. After any reassignment, confirm with kafka-configs --describe --entity-type brokers --entity-name 0 that no *.replication.throttled.rate config remains. If it is there and you are not actively reassigning, remove it.

Cruise Control: automated, continuous balancing

Doing reassignments by hand works for occasional cluster changes. It does not scale to a busy cluster where load shifts constantly and brokers fail. LinkedIn’s Cruise Control is the standard answer: it continuously monitors broker load and generates optimized, goal-driven reassignment plans automatically.

Cruise Control models the cluster against a prioritized list of goals — for example, replica distribution, disk usage, network inbound and outbound rate, CPU, and leader distribution. It computes a proposal that satisfies the hard goals and optimizes the soft goals, then executes the moves with built-in throttling.

What it gives you over manual reassignment:

Self-healing. When a broker dies, Cruise Control can automatically reassign its partitions to restore replication, rather than waiting for a human.
Goal-based optimization. Instead of you hand-crafting JSON, you declare what “balanced” means and it figures out the moves.
Anomaly detection. It watches for broker failures, goal violations, slow brokers, and disk pressure, and can act on them.
Safe execution. It paces and throttles reassignments to limit production impact, the same concern you handle manually with --throttle.

The trade-off is operational complexity. Cruise Control is another service to run, with its own metrics reporter deployed on each broker, its own load model that needs time to warm up, and goal configuration that rewards careful thought. For a small cluster that changes rarely, manual reassignment with throttles is simpler. For a large, dynamic cluster, Cruise Control pays for itself quickly by removing a whole class of manual toil and reducing the time a failed broker leaves you under-replicated.

Pro Tip: Let Cruise Control run in dry-run mode first. It will propose plans without executing them, so you can see what it wants to do and confirm its goals match your intent before you let it move real data. The load model also needs a warm-up window to gather enough metrics — do not judge its proposals in the first hour.

Consumer group rebalancing: eager vs. cooperative

The other meaning of rebalancing happens entirely on the client side. When a consumer joins or leaves a group, or partitions are added, the group must redistribute partitions among the surviving consumers. How that redistribution happens has changed significantly, and the difference matters for every deploy.

The problem with eager rebalancing

The original protocol uses eager rebalancing, often called stop-the-world. When the group rebalances, every consumer revokes all of its partitions, then the group reassigns everything from scratch. During that window, no consumer in the group processes any records. For a routine deploy that rolls consumers one at a time, this means a processing stall on every single restart — and on a large group, those stalls add up to real downtime.

Cooperative rebalancing

Cooperative (incremental) rebalancing fixes this. Instead of revoking everything, it computes the difference between the current and desired assignment and only moves the partitions that actually need to move. A consumer keeps processing the partitions it retains throughout the rebalance. The interruption is limited to the handful of partitions changing owners, not the entire group.

Enable it on the consumer with the cooperative assignor:

partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor

The CooperativeStickyAssignor does two good things at once: it rebalances incrementally, and it is sticky, meaning it tries to keep partitions with the same consumer across rebalances to preserve local state and warm caches. For Kafka Streams, cooperative rebalancing is the default behavior.

Static membership and reducing rebalances

The best rebalance is the one that never happens. Two settings reduce unnecessary rebalances:

Static membership via group.instance.id. Giving each consumer a stable identity means a quick restart (a rolling deploy, a brief crash) does not trigger a rebalance at all, as long as the consumer rejoins within session.timeout.ms. The group treats it as the same member returning rather than a new one.
Tuned timeouts. session.timeout.ms and heartbeat.interval.ms control how quickly the group decides a consumer is gone. Too tight and transient hiccups cause spurious rebalances; too loose and real failures take longer to detect.

Pro Tip: Combine CooperativeStickyAssignor with static membership (group.instance.id) for the smoothest deploys. Cooperative rebalancing minimizes the disruption when a rebalance does occur, and static membership prevents the rebalance entirely for short restarts. Together they turn a rolling consumer deploy from a series of processing stalls into a near-seamless operation.

Choosing the right strategy

The strategies map cleanly to situations:

Situation	Strategy
Added a broker, need to fill it	Manual reassignment with `--throttle`
Draining a broker for maintenance	Manual reassignment moving replicas off it
Large, dynamic cluster, frequent change	Cruise Control with goal-based balancing
Automatic recovery from broker failure	Cruise Control self-healing
Smooth consumer deploys	`CooperativeStickyAssignor` + static membership
Reducing spurious consumer rebalances	Static membership and tuned session timeouts

The unifying theme across both kinds of rebalancing is the same: movement is necessary but movement is disruptive, so you control the blast radius. On the broker side that control is the throttle and Cruise Control’s paced execution. On the consumer side it is cooperative, incremental reassignment and static membership. In both cases the failure mode is the same naive instinct — move everything as fast as possible — and the fix is the same discipline of moving only what is needed, only as fast as the cluster can absorb.

Get reassignment throttling right and you will never again take down production trying to balance it. Adopt cooperative rebalancing and static membership and your consumer deploys stop being a source of latency spikes. Those two habits cover the overwhelming majority of rebalancing pain, and they pair naturally with solid Kafka monitoring so you can watch under-replicated partitions and consumer lag while the moves happen.

— James