Migrating Kafka from ZooKeeper to KRaft

KRaft (Kafka Raft) is the metadata management mode that replaces ZooKeeper, and migrating to it is no longer optional for teams planning to stay current. ZooKeeper has been Kafka’s external metadata store since the beginning, holding broker registrations, topic configs, ACLs, and partition assignments. KRaft moves all of that into Kafka itself, managed by an internal Raft quorum of controllers, removing the second distributed system you had to operate, secure, and reason about. ZooKeeper mode is deprecated and removal is on the roadmap, so the question for most teams is not whether to migrate but when and how to do it safely. This guide covers why KRaft matters, the prerequisites that make migration safe, the migration steps, and how to roll back if something goes wrong.

Why KRaft replaces ZooKeeper

The motivation is operational simplicity and scale. ZooKeeper was a separate cluster with its own configuration, its own failure modes, and its own security model. Every Kafka operator had to run and monitor two distributed systems that disagreed about the world often enough to cause real incidents. KRaft folds metadata management into Kafka’s own brokers and controllers, which delivers concrete benefits:

One system to operate. No separate ZooKeeper ensemble to provision, patch, secure, and monitor. Fewer moving parts means fewer failure modes.
Faster metadata operations and recovery. Metadata lives in an internal Kafka log replicated by a Raft quorum. Controller failover and metadata propagation are markedly faster than the ZooKeeper-based controller.
Much higher partition scalability. The ZooKeeper-based design strained at very large partition counts because metadata had to be loaded and watched through ZooKeeper. KRaft handles far more partitions per cluster.
A single security model. No more separately securing ZooKeeper’s access alongside Kafka’s. The metadata path uses Kafka’s own mechanisms.

In KRaft, a subset of nodes act as controllers forming the Raft quorum that owns metadata, while brokers serve client traffic. Nodes can be dedicated controllers, dedicated brokers, or combined-mode for small clusters. The metadata itself lives in an internal topic, __cluster_metadata, replicated across the controller quorum.

Prerequisites before you migrate

Migration is well-supported but unforgiving of skipped prerequisites. Confirm every one of these before touching production.

Run a recent Kafka 3.x version that supports ZooKeeper-to-KRaft migration. Migration tooling matured across the 3.x line; be on a version where it is supported and stable, and ideally upgrade your ZooKeeper-mode cluster to that version before starting the migration rather than combining an upgrade with the migration.
Cluster ID in hand. You will need the existing cluster’s ID so the new KRaft controllers adopt the same cluster rather than forming a new one.
All brokers on the same metadata version. The inter.broker.protocol.version must be set to a value that supports migration across the whole cluster before you begin.
A tested backup of ZooKeeper data and broker configs. Migration has a rollback path, but only if you have not crossed the point of no return. Back up ZooKeeper state and snapshot your configs.
A staging rehearsal. Run the entire migration end-to-end in a non-production cluster that mirrors production topology first. The steps are mechanical but the failure modes are subtle.

Pro Tip: Capture the full pre-migration topic, ACL, and config inventory with kafka-topics.sh --describe, kafka-acls.sh --list, and kafka-configs.sh --describe and store the output. After migration you will diff against it to prove nothing was lost in the metadata transfer.

How the migration works

The ZooKeeper-to-KRaft migration is a controlled, online process. The headline is that it does not require a big-bang cutover or extended downtime — brokers keep serving traffic while metadata moves. The mechanism is a dedicated KRaft controller quorum that runs in a special migration mode, ingests the existing metadata from ZooKeeper, and then takes ownership once the copy is complete.

The phases, at a high level:

Provision a KRaft controller quorum in migration mode. Stand up new controller nodes configured to migrate, pointing at the existing ZooKeeper so they can read current metadata.
The controllers ingest ZooKeeper metadata. The migration controllers copy topics, configs, ACLs, and partition state into the KRaft metadata log. During this dual-write phase, metadata changes are written to both stores so neither falls behind.
Roll the brokers into migration mode. Restart each broker with config that points it at the KRaft controllers while still able to fall back. Brokers continue serving clients throughout this rolling restart.
Finalize the migration. Once all brokers are in KRaft mode and metadata is fully synchronized, take the controllers out of migration mode. ZooKeeper is now disconnected from the metadata path.
Decommission ZooKeeper. After a stabilization period in pure KRaft mode, shut down and remove the ZooKeeper ensemble.

A bootstrapped KRaft controller is initialized with the existing cluster ID using the storage tool:

# Format the controller's metadata log with the EXISTING cluster ID
kafka-storage.sh format \
  --cluster-id <existing-cluster-id> \
  --config /opt/kafka/config/kraft/controller.properties

A migration-mode controller’s config carries the flags that connect it to ZooKeeper and enable migration:

# controller.properties (migration mode)
process.roles=controller
node.id=3000
controller.quorum.voters=3000@controller-0:9093,3001@controller-1:9093,3002@controller-2:9093
controller.listener.names=CONTROLLER
listeners=CONTROLLER://:9093

# Enable ZooKeeper-to-KRaft migration
zookeeper.metadata.migration.enable=true
zookeeper.connect=zk-0:2181,zk-1:2181,zk-2:2181

And each broker, during its migration-mode restart, is told to migrate while still reachable by both worlds:

# broker server.properties (migration mode)
zookeeper.metadata.migration.enable=true
controller.quorum.voters=3000@controller-0:9093,3001@controller-1:9093,3002@controller-2:9093
controller.listener.names=CONTROLLER
zookeeper.connect=zk-0:2181,zk-1:2181,zk-2:2181

Pro Tip: Watch the controller logs for the migration state transitions and confirm the dual-write phase completes before you finalize. Finalizing while metadata is still syncing is the mistake that causes lost configs. The migration is event-driven, not time-boxed — wait for the “migration complete” signal, do not guess based on elapsed time.

Validating the migration

Do not declare success on the absence of errors. Prove the metadata transferred and the cluster behaves.

# Confirm metadata is being served from KRaft, not ZooKeeper
kafka-metadata-quorum.sh --bootstrap-server kafka:9092 describe --status

# Diff the post-migration topic inventory against your pre-migration capture
kafka-topics.sh --bootstrap-server kafka:9092 --list

# Verify ACLs survived the transfer
kafka-acls.sh --bootstrap-server kafka:9092 --list

# Confirm per-topic configs like min.insync.replicas are intact
kafka-configs.sh --bootstrap-server kafka:9092 \
  --describe --topic orders

The kafka-metadata-quorum.sh ... describe --status output shows the current leader, the quorum voters, and how caught-up each replica is — this is your proof the Raft quorum is healthy and serving metadata. Run a real produce-and-consume round trip against a test topic, force a controlled leader election, and confirm a broker restart rejoins cleanly. Diff every inventory against your pre-migration capture so a missing ACL or dropped config surfaces now, not in a future incident.

Rollback: the window and the procedure

The migration is designed to be reversible, but only up to a specific point: rollback is possible as long as you have not finalized the migration and disabled ZooKeeper. During the migration and dual-write phase, ZooKeeper still holds current metadata because both stores are being written, so you can revert.

The rollback procedure, while still in migration mode:

Stop relying on the KRaft controllers. Reconfigure brokers back to ZooKeeper-only mode by removing the migration and KRaft controller settings from their configs.
Restart brokers in ZooKeeper mode. Because ZooKeeper was kept current through dual writes, it still holds accurate metadata and the brokers resume against it.
Decommission the migration-mode KRaft controllers. Remove the controller quorum you provisioned for the migration.

The hard rule is that once you finalize the migration and disconnect ZooKeeper, the dual-write safety net is gone and rollback is no longer a configuration change — your only recovery is restoring from backup. That is exactly why you keep ZooKeeper running and untouched through a stabilization period after the brokers are in KRaft mode, and why you do not decommission it the same day you finalize.

Pro Tip: Treat “finalize” as a one-way door and schedule it deliberately. Run in the dual-write phase long enough to be confident — through at least one normal traffic peak — before you finalize. The cost of staying in migration mode an extra day is trivial compared to the cost of a failed finalize with no rollback.

Key takeaways

Point	Details
KRaft removes ZooKeeper	One system to operate, faster recovery, far higher partition scalability, unified security.
Prerequisites are mandatory	Recent supported 3.x, same cluster ID, aligned protocol version, tested backups, staging rehearsal.
Migration is online	Brokers keep serving traffic; a migration-mode controller quorum ingests metadata via dual writes.
Validate against a pre-capture	Diff topics, ACLs, and configs after migration; confirm the quorum with `kafka-metadata-quorum.sh`.
Finalize is a one-way door	Rollback works only before finalizing and disabling ZooKeeper; keep ZooKeeper running during stabilization.

What I would not rush

Having run enough stateful migrations to be appropriately nervous about this one, the discipline I would not skip is the staging rehearsal. The ZooKeeper-to-KRaft path is genuinely well-built and the online migration works as advertised, but the failure modes are subtle and the documentation assumes you read the migration state transitions carefully. Rehearsing on a cluster that mirrors production topology turns surprises into known steps.

The single decision I would treat with the most respect is finalizing. Everything before it is reversible because ZooKeeper is kept current through dual writes; everything after it is a backup-restore exercise. I would deliberately sit in the dual-write phase through a full traffic peak, validate against my pre-migration inventory three times, and only then walk through the one-way door. There is no prize for finalizing fast.

My read: this migration is mature enough to do with confidence on production Kafka, but it rewards patience over speed. Back up, rehearse, validate against a captured inventory, and treat the finalize step as the irreversible commitment it is.

— James

Build your AI Kafka workflow with DevOps AI ToolKit

DevOps AI ToolKit publishes prompts and automation guides for engineers running production streaming systems. Browse the full AI prompt library for prompts that help you draft migration runbooks, diff pre- and post-change inventories, and document rollback procedures.

FAQ

Why migrate Kafka from ZooKeeper to KRaft?

KRaft removes the separate ZooKeeper cluster, giving you one system to operate, faster metadata operations and controller failover, much higher partition scalability, and a single security model. ZooKeeper mode is deprecated and scheduled for removal.

Does the migration require downtime?

No. The migration is an online process: brokers keep serving client traffic while a migration-mode KRaft controller quorum ingests metadata from ZooKeeper through a dual-write phase and brokers are rolled into KRaft mode.

Can I roll back if the migration goes wrong?

Yes, but only before you finalize and disable ZooKeeper. During the dual-write phase ZooKeeper stays current, so you can revert brokers to ZooKeeper mode. After finalizing, recovery requires restoring from backup.

What should I validate after migrating?

Confirm the metadata quorum is healthy with kafka-metadata-quorum.sh, then diff topics, ACLs, and per-topic configs against a pre-migration capture to prove nothing was lost, and run a real produce-consume round trip.