Kafka Troubleshooting Toolkit
Diagnose broker and controller issues, under-replicated partitions, consumer lag, rebalances, and retention with prompts and Kafka runbooks.
Top Kafka errors
Start with the most common production issues and troubleshooting paths.
AI-Assisted Kafka Troubleshooting Explained
How AI-assisted Kafka troubleshooting works — diagnosing broker faults, consumer lag, rebalance storms, and ISR shrink faster…
Debugging Kafka Consumer Lag with AI
Measure Kafka consumer lag correctly, find the real root cause with AI-assisted analysis, and apply durable fixes — from poison…
AuthorizationException
Fix Kafka AuthorizationException: diagnose missing ACLs, wrong principal mapping, allow.everyone.if.no.acl.found, super.users…
java.io.IOException: Broken pipe
Fix Kafka 'Broken pipe' — diagnose writes to closed sockets, idle-timeout disconnects, oversized requests, and broker-side conn…
Broker may not be available
Fix Kafka 'Connection to node 1 could not be established. Broker may not be available': diagnose down brokers, wrong bootstrap…
BrokerEndPointNotAvailableException
Fix Kafka BrokerEndPointNotAvailableException: a listener or security protocol has no advertised endpoint. Fix listeners, adver…
CertificateExpiredException
Fix Kafka CertificateExpiredException: diagnose expired broker or client certs, expired CA roots, clock skew, and short-lived c…
ClusterAuthorizationException
Fix Kafka ClusterAuthorizationException: diagnose missing CLUSTER ACLs, idempotent producer IdempotentWrite, transactional IDs…
Best Kafka prompts
Use these prompts to turn symptoms, logs, and config into a structured troubleshooting plan.
Kafka Cluster Sizing & Capacity Planning
Size a Kafka cluster end to end — broker count, partition counts, retention, disk, memory, and network — for a target throughput, with headroom for spikes and broker failure.
Kafka Consumer Lag Investigation
Investigate and reduce growing consumer lag by isolating the root cause — slow processing, partition skew, GC pauses, or broker-side bottlenecks — then prescribe targeted fixes.
Kafka Consumer Rebalance Storm Triage
Diagnose frequent or looping consumer-group rebalances by working through session, heartbeat, and poll timeouts, static membership, and the rebalance protocol in use.
Kafka Exactly-Once Semantics Design
Design exactly-once processing across a produce-process-consume pipeline using the idempotent producer and transactions, with honest guidance on where EOS holds and where it does not.
Free Kafka tools
Validate, troubleshoot, or analyze your configuration before production changes.
AI Incident Response Assistant
Paste broker logs and consumer-group state, get a triage plan.
Start triageYAML validator
Validate Connect configs and Kubernetes/Strimzi manifests before applying.
Open validatorKafka runbook
Use a repeatable checklist for production troubleshooting.
A checklist for brokers, partitions, and consumers that fall behind or drop out.
- 1 Check broker and controller status
- 2 Inspect partitions and ISR (under-replicated / offline)
- 3 Review consumer-group lag and recent rebalances
- 4 Check KRaft / ZooKeeper quorum and connectivity
- 5 Validate retention, throughput, and disk headroom