Kafka Error Guide: 'TimeoutException: Expiring N record(s)' Producer Send Timeout
Fix Kafka producer 'TimeoutException: Expiring 5 record(s) ... ms has passed since batch creation': tune delivery.timeout.ms, request.timeout.ms, linger.ms, batch.size and buffer.memory.
- #kafka
- #troubleshooting
- #errors
- #producer
Exact Error Message
This exception is delivered to your producer’s send() callback (or thrown from a blocking future.get()) when a batch cannot be acknowledged inside its allotted time budget:
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for orders-0:30000 ms has passed since batch creation
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeExceptionally(ProducerBatch.java:...)
at org.apache.kafka.clients.producer.internals.RecordAccumulator.expireBatches(RecordAccumulator.java:...)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:...)
at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:...)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:...)
at java.base/java.lang.Thread.run(Thread.java:840)
In application logs it usually surfaces through the asynchronous callback rather than on the calling thread:
[2026-06-29 14:02:11,883] ERROR Failed to deliver record to orders (com.acme.OrderProducer)
org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for orders-0:30000 ms has passed since batch creation
A closely related variant is raised by the broker rather than the client: Produce request timed out (mapped to the broker-side REQUEST_TIMED_OUT error code) which the producer logs as a retriable TimeoutException for the specific in-flight request. The 30000 ms value is the elapsed time against delivery.timeout.ms, and orders-0 identifies the topic-partition whose batch expired.
What the Error Means
Every record you call send() on is appended to an in-memory RecordAccumulator batch keyed by topic-partition. From the moment that batch is created, a clock starts ticking against delivery.timeout.ms (default 120000 ms). If the batch has not been acknowledged by all required replicas before that deadline, the producer gives up, removes the batch from the accumulator, and completes its callback exceptionally with TimeoutException.
The message wording is precise: “Expiring 5 record(s)” means five records were sitting in that one batch, and “30000 ms has passed since batch creation” means delivery.timeout.ms was configured to 30000. This is an end-to-end deadline that spans three sub-phases: time waiting in the accumulator (governed by linger.ms and batch.size), time blocked because buffer.memory is exhausted, and time spent in flight waiting for a broker response (governed by request.timeout.ms plus retries/retry.backoff.ms). Any phase that stalls can blow the overall budget.
Crucially, expiration can happen before the record ever leaves the JVM. If the leader for orders-0 is unavailable, metadata is stale, or the broker is simply slow, the batch never gets sent and times out purely on the client side.
Common Causes
delivery.timeout.msset too low. Lowering it to 30000 (as in the message) leaves little headroom once you account forrequest.timeout.msplus retries. The constraintdelivery.timeout.ms >= linger.ms + request.timeout.msmust hold, and the producer will refuse to start if it does not.- Broker unreachable or leader unavailable. A partition whose leader is down (or mid-election) has no destination, so batches accumulate and expire. Look for
NOT_LEADER_OR_FOLLOWERand metadata refreshes preceding the timeout. - Slow broker / overloaded cluster. High request-handler queue depth, slow disks, or
acks=allwith a lagging follower makes each produce request slow enough that retries exhaust the budget. buffer.memoryexhausted. When the 32 MB default buffer fills (downstream is slower than your produce rate),send()blocks for up tomax.block.ms, eating into the delivery deadline before the batch is even queued.- Network saturation or throttling. Quota throttling (
producer_byte_rate) or a saturated NIC delays responses, trippingrequest.timeout.msrepeatedly. - Large batches with aggressive linger. A high
linger.mscombined with small throughput means batches sit waiting and can age out under a tight delivery timeout.
How to Reproduce the Error
Point a producer at a partition whose leader you can make unreachable, and set a deliberately tight delivery budget:
bootstrap.servers=kafka-1.internal:9092
acks=all
delivery.timeout.ms=30000
request.timeout.ms=10000
linger.ms=5
batch.size=16384
buffer.memory=33554432
max.block.ms=60000
Properties props = loadProducerProps();
try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
for (int i = 0; i < 5; i++) {
producer.send(new ProducerRecord<>("orders", "key-" + i, "payload-" + i),
(md, ex) -> { if (ex != null) ex.printStackTrace(); });
}
producer.flush(); // callbacks fire with TimeoutException once 30s elapses
}
Then stop the broker hosting the orders-0 leader (or block port 9092 with a firewall rule). With no leader to accept the batch, the five buffered records expire at the 30-second mark.
Diagnostic Commands
Confirm the partition has a live leader and healthy ISR:
kafka-topics.sh --bootstrap-server kafka-1.internal:9092 --describe --topic orders
Inspect the effective producer-relevant broker config (request handler threads, message size limits):
kafka-configs.sh --bootstrap-server kafka-1.internal:9092 --describe --entity-type brokers --entity-name 1
Verify broker API versions are reachable (a clean response proves basic connectivity):
kafka-broker-api-versions.sh --bootstrap-server kafka-1.internal:9092
Scan client logs for the sequence of metadata refreshes and disconnects that precede expiration:
grep -E "Expiring|NOT_LEADER|disconnected|metadata" /var/log/orders-producer/app.log
Check the broker’s own log for slow-request or replication warnings around the timestamp:
journalctl -u kafka --since "2026-06-29 14:00:00" --until "2026-06-29 14:05:00" | grep -iE "slow|timed out|isr"
Step-by-Step Resolution
-
Re-establish a healthy leader first. If
--describeshowsLeader: -1or a shrunken ISR, the timeout is a symptom of an unavailable partition. Restore the broker or wait out the election before tuning anything. -
Give the delivery budget realistic headroom. Raise
delivery.timeout.msback toward the default and keep the invariantdelivery.timeout.ms >= linger.ms + request.timeout.ms. A sane starting point:delivery.timeout.ms=120000 request.timeout.ms=30000 linger.ms=20 -
Right-size batching. Increase
batch.size(e.g. 65536) and a modestlinger.msso batches fill efficiently rather than dribbling out; this reduces request count and per-record overhead under load. -
Relieve buffer pressure. If logs show
send()blocking, raisebuffer.memoryor reduce produce rate. Watch thebuffer-available-bytesandbufferpool-wait-ratioJMX metrics. -
Tune for slow brokers, not against them. Keep
retrieshigh (the default is effectivelyInteger.MAX_VALUE) and letdelivery.timeout.msbe the real cap. Withenable.idempotence=true, retries stay safe and ordered. -
Reconsider
acks=allcost. Ifmin.insync.replicasplus a lagging follower is the bottleneck, fix replication health rather than weakening durability.
Prevention and Best Practices
- Treat
delivery.timeout.msas your single end-to-end SLA knob and derive the others from it; never set it belowrequest.timeout.ms. - Alert on the producer JMX metrics
record-error-rate,request-latency-avg, andbufferpool-wait-ratioso you see pressure before batches expire. - Keep
enable.idempotence=trueso aggressive retries do not introduce duplicates or reordering. - Capacity-plan brokers for
acks=all: a slow follower silently inflates produce latency. - When timeouts spike during an incident, route the stacktrace and broker state into the incident assistant for a guided root-cause walkthrough, and keep the rest of our Kafka guides handy for replication and connectivity follow-ups.
Related Errors
- NetworkException — transient connection drops to the leader often manifest as repeated retries that ultimately roll up into this delivery
TimeoutException. - NotEnoughReplicasException — when
acks=alland ISR falls belowmin.insync.replicas, produce requests stall and can age out, so you may see both errors together. - RecordTooLargeException — a different failure mode (rejected immediately, not on a timer), but worth ruling out when only some records fail.
Frequently Asked Questions
Does increasing request.timeout.ms fix this?
Only partially. request.timeout.ms caps a single in-flight produce request; delivery.timeout.ms caps the whole lifecycle including queue time and retries. If batches expire while waiting for a leader, raising request.timeout.ms alone changes nothing — raise the delivery budget instead.
Why do I see “since batch creation” when the record never left my app? The clock starts when the batch is created in the accumulator, not when it is sent. A record can expire purely client-side if there is no available leader to send it to, which is exactly why this error often points to broker or metadata problems.
Can linger.ms cause timeouts?
Indirectly. A large linger.ms makes batches wait longer before sending, consuming part of the delivery budget. Under low throughput with a tight delivery.timeout.ms, that wait can be the difference between success and expiration.
Will retries make this worse?
No — retries are bounded by delivery.timeout.ms, not by the retries count. With idempotence enabled, generous retries are safe and are usually the right way to survive a slow or briefly unavailable broker.
Is this error retriable from my code?
This specific expiration is terminal for those records; they were dropped from the buffer. You must re-send() them yourself. Design your callback to re-enqueue or dead-letter expired records rather than assuming the client will retry them.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.