Kafka Error Guide: 'Controller mutation rate quota exceeded' Throttled Topic Ops
Fix Kafka controller mutation rate quota errors: understand CONTROLLER_MUTATION quotas, throttled topic create/delete/partition ops, and how to size the limit safely.
- #kafka
- #troubleshooting
- #errors
- #controller
Exact Error Message
A client that creates topics or partitions in bulk gets throttled, then rejected, by the controller mutation quota. Client side:
org.apache.kafka.common.errors.ThrottlingQuotaExceededException: The controller mutation rate
quota has been exceeded. Retry after 4213 ms (CONTROLLER_MUTATION).
On the controller in server.log:
[2026-06-28 16:44:09,772] INFO [Controller id=3] Throttling CreateTopics request from clientId=bulk-loader,
principal=User:etl: controller mutation rate quota exceeded (accumulated 612 partition mutations,
limit 50/s), throttle time 4213 ms (kafka.server.ControllerMutationQuotaManager)
[2026-06-28 16:44:14,005] WARN [Controller id=3] Controller mutation failed for CreateTopics request:
quota violation, request rejected after retries (kafka.controller.KafkaController)
What the Error Means
Kafka 2.6+ added a controller mutation quota (CONTROLLER_MUTATION) to protect the controller from being overwhelmed by metadata-changing operations. A “mutation” is counted in partitions touched by topic create, topic delete, and create-partitions requests — not the number of API calls. The quota is a rate, measured in accepted partition mutations per second, applied per user/client-id principal.
When a client exceeds its configured rate, the controller does not immediately fail. It first throttles: it accepts what fits within the rate and returns a throttle_time_ms telling the client to back off. Well-behaved clients honor it and the operation eventually succeeds. The error surfaces as ThrottlingQuotaExceededException / “controller mutation rate quota exceeded” when the request would exceed the quota and the client isn’t waiting long enough (or has retries disabled), so the controller rejects it. “Controller mutation failed” is the controller-side log of that rejection. Crucially this is a back-pressure / capacity-protection mechanism, not a controller fault or outage. The controller is healthy; it is deliberately rate-limiting an aggressive metadata workload — typically a bulk topic loader, an automated provisioning job, or a misbehaving CI pipeline creating thousands of partitions at once.
Common Causes
- Bulk topic/partition creation by an ETL job, test harness, or provisioning script that creates many topics (or high-partition topics) in a tight loop.
- A low
controller_mutation_ratequota configured (cluster default or per-principal) relative to legitimate workload. - Clients with retries disabled or short timeouts that don’t honor
throttle_time_ms, turning a throttle into a hard failure. - Repeated create-partitions calls that each touch many partitions, accumulating mutations quickly.
- A runaway/looping client (bug or misconfiguration) hammering CreateTopics.
- Mass topic deletion (e.g., cleanup jobs) which also counts as mutations.
How to Reproduce the Error
On a test cluster:
- Set a low quota for a test principal so it’s easy to trip:
kafka-configs.sh --bootstrap-server localhost:9092 --alter --add-config 'controller_mutation_rate=10' --entity-type users --entity-name etl(this is a config change — run only in a throwaway cluster). - As that principal, rapidly create many topics: a loop calling
kafka-topics.sh --create --topic t-$i --partitions 20 --replication-factor 1. - The client logs
ThrottlingQuotaExceededException ... CONTROLLER_MUTATION, and the controller logs throttling and, if retries are off, “controller mutation failed”.
Throttling is expected behavior; the point is to observe the quota engaging.
Diagnostic Commands
All read-only.
Confirm the controller is healthy (this is back-pressure, not a fault):
kafka-metadata-quorum.sh --bootstrap-server localhost:9092 describe --status
View the configured controller mutation quotas (read-only --describe):
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type users --entity-default
kafka-configs.sh --bootstrap-server localhost:9092 --describe --entity-type users --entity-name etl
Quota configs for user-principal 'etl' are controller_mutation_rate=50.0
Check broker reachability/version (the quota exists in 2.6+):
kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | head -3
Identify who is being throttled and how often:
grep -E "controller mutation rate quota exceeded|Throttling .*Topics|controller mutation failed" \
/var/log/kafka/server.log | tail -40
journalctl -u kafka --since "1 hour ago" | grep -iE "mutation|throttl|quota"
This reveals the offending clientId/principal and the accumulated mutation count vs the limit.
Step-by-Step Resolution
- Confirm it’s a quota, not a controller failure.
describe --statusshows a healthy leader; the logs say “mutation rate quota exceeded”. The control plane is fine. - Identify the offender from the throttling log lines — the
clientIdandprincipalcreating the mutation storm. - Fix the client first (preferred). Have the bulk job: create topics in batches, sleep between calls, and enable retries so it honors
throttle_time_ms. A client that respects throttling completes successfully without raising the error. - Right-size the quota if the workload is legitimate. Inspect the current value with
kafka-configs.sh --describe, then raisecontroller_mutation_ratefor that principal (or the default) to a level the controller can sustain. Do this deliberately — the quota exists to protect the controller. - For a runaway client, throttle or pause it; the quota is doing its job by shielding other tenants.
- Validate: rerun the workload with batching/retries (and the adjusted quota if changed) and confirm
ThrottlingQuotaExceededExceptionno longer escalates to a hard failure.
Prevention and Best Practices
- Make bulk topic/partition operations idempotent and retry-aware so they honor
throttle_time_msinstead of failing — disabling AdminClient retries is the most common cause of the hard error. - Provision topics in batches with pacing rather than thousands at once; spread large migrations over time.
- Set a sane cluster-default
controller_mutation_rateand grant higher per-principal quotas only to trusted provisioning identities. - Use distinct
client.id/principals for bulk tools so their quota (and throttling) is isolated from application traffic. - Monitor controller mutation throttling metrics and alert on sustained throttling, which signals either an undersized quota or a misbehaving client.
- Prefer fewer, higher-partition topics over churning many topics, and avoid create/delete loops in tests. See the Kafka guides for related capacity topics.
Related Errors
Controller not available— a genuine control-plane outage, unlike this deliberate throttle.org.apache.kafka.common.errors.ThrottlingQuotaExceededException— the umbrella exception; theCONTROLLER_MUTATIONvariant is this guide.- Producer/consumer byte-rate quota throttling (
QuotaViolationException) — a different quota dimension (data rate, not mutations). TopicExistsException— bulk creators that don’t check existence can compound mutation pressure.
Frequently Asked Questions
Is my controller broken?
No. The controller mutation quota is intentional back-pressure to protect the controller from metadata storms. describe --status will show a healthy leader. The fix is on the client or the quota size, not the controller.
What exactly counts as a “mutation”? Partitions touched by create-topic, delete-topic, and create-partitions requests. A single CreateTopics call for one 50-partition topic counts as 50 mutations, not 1.
Why does it sometimes succeed and sometimes fail?
If the client honors the returned throttle_time_ms and retries, the throttled operation eventually completes. If retries are disabled or time out, the throttle escalates to ThrottlingQuotaExceededException.
How do I see the current limit?
kafka-configs.sh --describe --entity-type users --entity-name <principal> (and --entity-default) shows controller_mutation_rate. The log lines also print the effective limit and accumulated mutations.
Should I just raise the quota to a huge number? Only after confirming the controller can sustain it and the workload is legitimate. The quota protects all tenants; an unbounded value reopens the controller to overload. Fix client pacing first, then size the quota to real need.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.