Kafka Error Guide: 'OutOfOrderSequenceException' Out of Order Sequence Number
Fix Kafka OutOfOrderSequenceException: diagnose idempotent producer sequence gaps from dropped batches, message loss via unclean leader election, and PID resets; why it is non-recoverable and how to prevent it.
- #kafka
- #troubleshooting
- #errors
- #producer
Exact Error Message
org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number.
In a producer’s logs it appears as a fatal sender-thread error that abandons in-flight batches:
[2026-06-29 09:18:44,201] ERROR [Producer clientId=payments-writer-2, transactionalId=payments-eos-2] Aborting producer batches due to fatal error (org.apache.kafka.clients.producer.internals.Sender)
org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number for partition payments-9: expected sequence 40517 but received 40519.
at org.apache.kafka.clients.producer.internals.TransactionManager.handleSequenceOverflow(TransactionManager.java:1041)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:740)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:612)
What the Error Means
With idempotence or exactly-once semantics (EOS) enabled, every batch a producer sends to a partition carries a monotonically increasing sequence number, scoped to the (producerId, epoch, partition) triple. The broker remembers the last sequence it accepted for that triple and requires the next batch to be exactly one greater. This is what makes idempotence work: a duplicate (a retry of an already-accepted batch) is recognized and silently discarded, and a genuine gap is detected.
OutOfOrderSequenceException is raised when the broker receives a sequence number that is higher than expected — there is a hole. In the log line above, the broker had accepted up through 40516 and expected 40517 next, but the batch carried 40519. Sequences 40517 and 40518 are missing. Since the broker cannot accept 40519 without breaking the contiguous ordering guarantee, and it cannot manufacture the missing batches, it rejects the write. The gap means either a batch that the producer believes it sent was never durably persisted, or records the broker previously accepted have disappeared from the log. In both cases the producer’s notion of “what has been written” no longer matches the broker’s, and the idempotence guarantee is broken for that session. This is why Kafka treats it as fatal and non-recoverable: continuing would risk silent reordering or duplication.
Common Causes
max.in.flight.requests.per.connection > 5with idempotence: idempotence guarantees ordering only up to 5 in-flight requests. With more, an earlier batch can fail and be retried after a later batch already landed, leaving a permanent sequence gap. Kafka enforces a ceiling of 5 for idempotent producers precisely to avoid this; overriding it (or older clients that allowed it) reintroduces the bug.- A dropped or expired batch: if a batch hits
delivery.timeout.msand is dropped while subsequent batches succeed, the accepted stream now has a hole. The next send carries a sequence beyond the gap. - Message loss on the broker via unclean leader election: if
unclean.leader.election.enable=true, a non-in-sync replica can become leader and discard committed-but-not-yet-replicated records. The producer counted those records as written; the new leader never saw them, so the producer’s next sequence is ahead of the leader’s expectation. - Producer ID expiry / reset: when a PID is expired and reset, the sequence counter resets too. A stale in-flight batch from the old PID context can then look out of order relative to the new state.
- Multiple producers sharing a transactional.id incorrectly, causing interleaved sequences under the same PID.
How to Reproduce the Error
The most direct reproduction uses an idempotent producer configured to allow more than five in-flight requests (on a client/version that does not hard-cap it) against a partition where an early batch is forced to fail and retry behind later ones. A cleaner conceptual reproduction is to enable unclean leader election and induce a leader change while a producer streams faster than replication:
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker:9092");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.ACKS_CONFIG, "all");
// Pushing in-flight beyond the idempotence-safe ceiling of 5 invites gaps.
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 10);
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
Then drive sustained load while injecting transient failures (e.g., briefly partitioning the leader) so an early batch retries after later ones commit. The next send surfaces the out-of-order sequence. Reproduce only on a disposable cluster, since unclean leader election destroys data.
Diagnostic Commands
Check whether unclean leader election is enabled at the broker or topic level, the most damaging cause:
kafka-configs.sh --bootstrap-server broker:9092 --describe \
--entity-type brokers --entity-name 1 --all | grep -i unclean
kafka-configs.sh --bootstrap-server broker:9092 --describe \
--entity-type topics --entity-name payments | grep -i unclean
Inspect the partition’s replica and ISR state to spot recent leader changes or shrunken ISR:
kafka-topics.sh --bootstrap-server broker:9092 --describe \
--topic payments --under-replicated-partitions
kafka-topics.sh --bootstrap-server broker:9092 --describe --topic payments
Confirm client/broker capabilities and surface the error context from logs:
kafka-broker-api-versions.sh --bootstrap-server broker:9092 | head -20
grep -i "OutOfOrderSequence\|out of order sequence\|unclean leader" \
/var/log/kafka/server.log
journalctl -u kafka --since "1 hour ago" | grep -iE "leader election|isr"
Step-by-Step Resolution
- Stop overriding in-flight requests. Ensure
max.in.flight.requests.per.connectionis 5 or fewer for any idempotent producer. This is the single most common application-side cause and the easiest to fix:
enable.idempotence=true
acks=all
max.in.flight.requests.per.connection=5
retries=2147483647
- Disable unclean leader election everywhere durability matters. With it off, an out-of-sync replica can never become leader, so committed records are not silently dropped:
unclean.leader.election.enable=false
min.insync.replicas=2
Combined with acks=all and replication.factor=3, a producer’s acknowledged writes are guaranteed to survive leader changes.
3. Restart the producer to obtain a fresh PID. Because the exception is fatal, the existing producer instance is dead. The application must close it and create a new one. There is no in-place recovery; design your code to recreate the producer on this exception.
4. Investigate the gap’s origin using the ISR/leader-election diagnostics. If a leader change coincided with the error, message loss from unclean election is the likely root cause and you should audit data integrity downstream — records may genuinely be lost, not merely re-sent.
5. Check for PID expiry if the producer had been idle; an expired PID that reset can produce a transient out-of-order condition. Align retention.ms and producer.id.expiration.ms with producer cadence.
6. For transactional producers, abort the in-flight transaction on the new producer instance and re-process from the last committed offset so EOS is preserved across the restart.
Prevention and Best Practices
The durable preventions are structural: keep enable.idempotence=true with max.in.flight.requests.per.connection at or below 5, run acks=all with replication.factor=3 and min.insync.replicas=2, and set unclean.leader.election.enable=false cluster-wide. Together these ensure acknowledged batches are durably replicated and that leadership only moves to replicas that have them, which removes the two dominant causes of sequence gaps. Use long-lived producers so PIDs are not churned, and keep retention and PID-expiration windows comfortably above your producers’ idle gaps. Treat OutOfOrderSequenceException in code as a signal to recreate the producer rather than to retry. Monitor under-replicated partitions and ISR shrink events, since they precede the message-loss variant of this error. When it does fire in production, the incident assistant can correlate the timestamp with leader-election and ISR history to pinpoint whether data was actually lost. More patterns live in the Kafka guides.
Related Errors
OutOfOrderSequenceException shares the broker’s per-PID sequence machinery with UnknownProducerIdException; the difference is that the PID here is known but a batch has the wrong sequence, whereas there the PID itself is missing, and PID expiry can lead to either. A TimeoutException (delivery timeout) that drops a batch is a frequent upstream trigger for the gap. If the gap is caused by unclean leader election dropping records, the underlying durability failure may also coincide with NotEnoughReplicasException when the ISR has shrunk below min.insync.replicas.
Frequently Asked Questions
Why is OutOfOrderSequenceException non-recoverable?
Because a detected gap means the producer’s and broker’s views of what has been written have permanently diverged. Retrying cannot fill the missing sequences, and accepting the out-of-order batch would break the ordering/dedup guarantee. Kafka marks the producer fatal so you must recreate it with a fresh PID.
Does setting max.in.flight to 5 fully prevent this?
It prevents the most common application-side cause — reordering among concurrent in-flight requests — but not gaps caused by broker-side message loss. You also need unclean.leader.election.enable=false with acks=all and proper replication to close the durability path.
Can this error indicate actual data loss?
Yes. When the cause is unclean leader election, records the producer considered committed were discarded by a new leader. The sequence gap is the symptom of genuine loss, so audit downstream data when a leader change coincides with the error.
How is this different from UnknownProducerIdException?
Both involve PID sequence state. UnknownProducerIdException means the broker has no metadata for the PID at all. OutOfOrderSequenceException means the broker knows the PID but received a sequence number that skips ahead of what it expected.
Should my application retry or restart the producer?
Restart. The producer instance is fatally poisoned after this exception. Catch it, close the producer, recreate it (re-running transaction recovery for EOS workloads), and resume from the last committed offset. Plain retries on the same instance will not work.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.