Kafka without ZooKeeper: KRaft in production
Actualizado: 2026-05-03
Kafka 4.0 arrived in March 2025 and with it the ZooKeeper dependency disappeared definitively. The promise came from far away: KIP-500 was approved in 2019, KRaft mode was marked stable in Kafka 3.3 at the end of 2022, and 3.5 brought the migration path. What 4.0 does is close the door: you can no longer start a cluster with ZooKeeper, even if you want to. After months running KRaft clusters and completing forced migrations in a couple of real environments, time to review what really changes, what stays the same, and what to know before planning the jump.
For the data infrastructure context where Kafka operates, the analysis of Delta Lake and Iceberg comparison describes the storage layer that complements Kafka in modern architectures. The incremental migration pattern with external dependencies also applies to the context of Kubernetes 1.33 improvements, where platform upgrades follow similar logic.
Key takeaways
- KRaft stores cluster metadata in an internal topic
__cluster_metadatausing Raft for consensus; no external system. - A KRaft cluster starts in seconds versus tens of seconds or more than a minute with ZooKeeper.
- ZooKeeper metrics disappear; new controller quorum metrics appear that inherited dashboards need to update.
- Migration from Kafka 3.7/3.8 is done in dual mode without service interruption; Kafka 4.0 no longer supports ZooKeeper.
- If the cluster is on 2.x, the recommended path is going first to 3.5/3.6 with ZooKeeper before planning KRaft.
Why ZooKeeper was a problem
For more than a decade Kafka depended on ZooKeeper to keep cluster metadata: who the controller is, which brokers are alive, which topics exist, which partitions go to which replica. ZooKeeper is solid but wasn’t designed for the volume of metadata a large Kafka cluster generates. There was a practical ceiling: clusters with hundreds of thousands of partitions found ZooKeeper to be the bottleneck.
The second problem was operational: maintaining two different distributed systems with two different consensus models. A team had to know Kafka and also ZooKeeper, with its configuration, its commands, its failure modes. Every cluster expansion meant thinking about both sides.
KRaft is the answer: Kafka itself maintains its metadata using Raft as consensus algorithm, within a quorum of special brokers called controllers. No external system. Metadata is stored in an internal topic called __cluster_metadata and replicated following the same rules as any other topic.
What clearly improves
The first thing you notice is startup. A KRaft cluster starts in seconds. The same cluster with ZooKeeper took dozens of seconds, sometimes more than a minute. For frequent deployments this changes the rhythm; for recoveries after a failure it changes even more.
The second improvement is the partition ceiling. The Apache Kafka team has published tests with KRaft clusters supporting millions of partitions per cluster, versus several hundred thousand where ZooKeeper started to suffer. Knowing the ceiling is high changes how you design your topics: you can be more generous with partitions without fearing metadata becomes the bottleneck.
The third improvement is operational. A single technology, a single configuration, a single set of commands. Admin commands like kafka-topics and kafka-configs keep working the same, but behind them there’s no longer an additional system that can fail on its own.
What changes in operations
The deployment model changes in two points. First, brokers can take on the controller role in addition to the broker role. In a small cluster it’s common to have three nodes doing both. In a large cluster they’re separated: three or five nodes dedicated as controllers, the rest as pure brokers.
Second, the cluster ID concept appears from the start. In KRaft you have to generate this identifier with the kafka-storage tool during initialization. It’s not complicated but it’s a new step worth automating in the deployment playbook.
Monitoring changes too. ZooKeeper metrics disappear. New metrics appear about the controller quorum: how many lag, how long metadata updates take, how many events per second the active controller processes. Inherited Grafana dashboards need adjustment.
Migration without pain
Migration from a ZooKeeper cluster to KRaft is done in dual mode. The cluster runs with ZooKeeper and a KRaft quorum in parallel, metadata syncs, and once everything is verified the ZooKeeper dependency is cut. Kafka 3.7 and 3.8 support this mode. Kafka 4.0 no longer does: if you reach 4.0 you must have finished migrating.
The pattern in environments where migration has been done was similar:
- Update to 3.7 or 3.8 in ZooKeeper mode, as usual.
- Add KRaft controller nodes and activate dual mode; the cluster keeps working, no service interruption.
- Wait for metadata to sync, verify with admin commands that both sides agree.
- Switch broker roles from ZooKeeper to KRaft one by one.
- Once all brokers are in KRaft, decommission the ZooKeeper nodes.
Real time depends on cluster size. In a small cluster with three brokers and fifty topics, the full migration was done in one afternoon without downtime. In a medium cluster with fifteen brokers and several thousand topics, it took two days of preparation and a four-hour change window. In neither case was there message loss or service interruption.
Where it can hurt
There are points where migration hurts:
- External tools that talk directly to ZooKeeper: old management tools, internal scripts reading ZooKeeper nodes for inventory, third-party integrations that assume the previous model. All that stops working and must be cataloged before migrating and rewritten against the AdminClient.
- Very old client versions: Java 2.x+ clients and librdkafka work unchanged, but old custom clients can stumble.
- ACL and quota configuration: with ZooKeeper, many teams put them directly in the nodes; with KRaft these move to
kafka-configs. Migration is automatic but any deployment automation touching ZooKeeper nodes must be rewritten.
An example configuration in a KRaft controller’s server.properties:
process.roles=controller
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093
listeners=CONTROLLER://0.0.0.0:9093When to wait before migrating
There are teams that should wait:
- Cluster on Kafka 2.x: migrating to KRaft involves skipping several major versions. The recommended path is going first to 3.5 or 3.6 with ZooKeeper, stabilizing, then planning the step to KRaft.
- Integrations with proprietary software that speaks ZooKeeper: before migrating it’s worth explicitly verifying with each critical vendor. By 2025 most have KRaft support, but there are exceptions.
- In the middle of another large migration: if the team is migrating to Kubernetes or changing cloud providers at the same time, adding the KRaft change multiplies risk. It’s better to sequence.
My read
KRaft is a clear improvement and the jump to Kafka 4.0 is, for most clusters, a sensible migration done with time. The operational savings from removing ZooKeeper are real, performance improvements are measurable, and startup time improves the deployment experience. But it’s not a migration to do on a Friday afternoon: it requires planning, an inventory of external dependencies, and a thoughtful change window.
The practical recommendation is to tackle 4.0 with time: start by reading the 3.7 or 3.8 migration documentation, test dual mode in a staging environment, catalog every external integration that touches ZooKeeper, and prepare deployment playbooks ahead of time. This process, done carefully, takes a few weeks of distributed work and finishes without incidents.
What Kafka 4.0 consolidates is a simple idea: one technology, one consensus model, one thing to learn. For whoever operates Kafka daily this is the most significant change in years. For whoever just publishes and consumes messages, it’s invisible. Both are good signs.