Monitoring Kafka with Prometheus and AI

Kafka exposes hundreds of JMX metrics, and most monitoring setups either scrape almost none of them or scrape all of them and drown. Effective Kafka monitoring is about picking the handful of signals that actually predict trouble — under-replicated partitions, consumer lag, request latency — wiring them into Prometheus through the JMX exporter, and writing alert rules that fire on real degradation rather than noise. Once that foundation is solid, AI-assisted triage becomes genuinely useful: instead of an engineer pattern-matching across twelve Grafana panels at 2 AM, an AI layer can correlate the alert with the surrounding metrics and propose a probable cause. This guide covers the full path, from exporter config to AI triage, on Kafka 3.x.

Exposing Kafka metrics with the JMX exporter

Kafka is a JVM application and publishes its internal state over JMX. Prometheus does not speak JMX, so you bridge the two with the Prometheus JMX Exporter, run as a Java agent inside the broker process. Running it as an agent rather than a standalone process is the recommended approach because it avoids the overhead and latency of remote JMX.

Attach the agent through KAFKA_OPTS before starting the broker:

export KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx_exporter/kafka.yml"

The exporter needs a config that maps the verbose JMX MBean names into clean Prometheus metric names. A minimal but useful kafka.yml:

lowercaseOutputName: true
rules:
  # Under-replicated and offline partitions
  - pattern: "kafka.server<type=ReplicaManager, name=(UnderReplicatedPartitions|OfflinePartitionsCount)><>Value"
    name: "kafka_server_replicamanager_$1"
    type: GAUGE
  # Request latency by request type
  - pattern: "kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(\\w+)><>(\\d+)thPercentile"
    name: "kafka_network_request_total_time_ms"
    labels:
      request: "$1"
      quantile: "0.$2"
    type: GAUGE
  # Broker throughput
  - pattern: "kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec)><>OneMinuteRate"
    name: "kafka_server_brokertopicmetrics_$1"
    type: GAUGE
  # Active controller (should sum to exactly 1 across the cluster)
  - pattern: "kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value"
    name: "kafka_controller_active_count"
    type: GAUGE

Then point Prometheus at the exporter endpoint on each broker:

scrape_configs:
  - job_name: "kafka"
    static_configs:
      - targets:
          - "broker0.kafka.internal:7071"
          - "broker1.kafka.internal:7071"
          - "broker2.kafka.internal:7071"

Pro Tip: Keep your JMX exporter rules list short and explicit. A catch-all pattern: ".*" will scrape thousands of histogram buckets per broker and can add real CPU load plus a Prometheus cardinality explosion. Whitelist the metrics you alert on, not everything the JVM emits.

The metrics that actually matter

You do not need hundreds of metrics. You need a focused set covering broker health, replication, throughput, and the client side. Here are the ones that earn their place on a dashboard.

Metric	What it tells you	Healthy value
`UnderReplicatedPartitions`	Partitions missing in-sync replicas	0
`OfflinePartitionsCount`	Partitions with no leader (data unavailable)	0
`ActiveControllerCount` (cluster sum)	Exactly one broker is controller	1
`RequestHandlerAvgIdlePercent`	Spare capacity in request handler threads	> 0.3
`TotalTimeMs` (Produce/Fetch p99)	End-to-end request latency	Stable, no spikes
`BytesInPerSec` / `BytesOutPerSec`	Throughput, for capacity trends	Within plan
Consumer lag (per group)	Records behind the log end	Bounded, not growing

A few of these deserve emphasis:

Under-replicated partitions is the single most important broker health signal. A sustained non-zero value means a replica has fallen out of the ISR, and you are one broker failure away from data loss or an offline partition.
Offline partitions is an active incident. Those partitions are unavailable to producers and consumers right now.
Active controller count should sum to exactly 1 across the cluster. Zero means no controller — a serious problem. More than one means a split brain, which should not happen but is worth alerting on.
Consumer lag is the metric your application owners care about most, and it does not come from broker JMX. You expose it with a tool like Kafka Lag Exporter or kafka-consumer-groups, which reports the difference between the log end offset and the committed offset per partition.

Pro Tip: Request handler idle percent is an underrated early warning. When RequestHandlerAvgIdlePercent trends from 0.8 toward 0.1, the broker is running out of threads to service requests, and latency spikes are about to follow. It often moves before user-facing latency does, giving you lead time to scale or rebalance.

Writing alert rules that catch real problems

Good Kafka alerts are specific, have sensible for durations to suppress flapping, and map to a clear human action. Here is a working set of Prometheus alerting rules.

groups:
  - name: kafka.rules
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Under-replicated partitions on {{ $labels.instance }}"
          description: "{{ $value }} partitions under-replicated for 5m."

      - alert: KafkaOfflinePartitions
        expr: kafka_server_replicamanager_offlinepartitionscount > 0
        for: 1m
        labels: { severity: critical }
        annotations:
          summary: "Offline partitions in the cluster"

      - alert: KafkaNoActiveController
        expr: sum(kafka_controller_active_count) != 1
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Cluster controller count is {{ $value }}, expected 1"

      - alert: KafkaConsumerLagGrowing
        expr: kafka_consumergroup_lag > 100000
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Consumer group {{ $labels.consumergroup }} lag is {{ $value }}"

      - alert: KafkaRequestHandlerSaturation
        expr: kafka_server_kafkarequesthandlerpool_requesthandleravgidlepercent < 0.2
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Request handler idle below 20% on {{ $labels.instance }}"

Some principles behind these:

Match severity to urgency. Offline partitions get a 1-minute for because data is unavailable now. Under-replicated gets 5 minutes because a brief blip during a rolling restart is normal and you do not want to page on it.
Alert on absolute consumer lag with care. A fixed threshold is a starting point, but a better long-term signal is lag rate of change or estimated time-to-drain, since 100k lag on a high-throughput topic may be seconds of delay while on a slow topic it is hours.
Avoid alerting on raw throughput. BytesInPerSec dropping to zero might be an incident or might be a quiet Sunday. Alert on the symptoms (lag, latency, replication), not the volume.

Adding AI-assisted triage

Once the metrics and alerts exist, the bottleneck shifts from detection to diagnosis. An alert tells you what fired; it rarely tells you why. This is where an AI layer adds value, and it works best with the same discipline that makes AI useful elsewhere: feed it focused, structured evidence rather than a raw metric firehose.

A practical pattern looks like this:

Alert fires in Alertmanager and triggers a webhook to a triage service.
The service gathers context deterministically — it queries Prometheus for the firing metric plus a curated set of correlated metrics over the last 30 minutes, pulls recent broker log lines, and notes any in-flight operations like a partition reassignment.
It sends a compact, structured summary to the AI model: the alert, the metric trends as small numeric series, and the log excerpts. It does not paste raw scrape dumps.
The AI returns a ranked hypothesis — for example, “Under-replicated partitions rose at 02:14 on broker2 coincident with a disk I/O latency spike; probable cause is a slow disk on broker2, not a network partition” — plus the specific commands to confirm it.
A human validates before any remediation. The AI narrows the search space; it does not own the keyboard.

The deterministic-collection-then-AI-interpretation split is the same architecture that makes AI-assisted troubleshooting reliable in Kubernetes. The AI is an interpreter of evidence, not a replacement for it.

Pro Tip: The highest-leverage thing you can give an AI triage layer is correlation, not more metrics. When under-replicated partitions spike, the useful context is what else moved at the same timestamp — disk latency, network errors, GC pauses, a reassignment that just started. Pre-compute those correlations in your triage service so the AI reasons over a tight, relevant slice instead of guessing.

Keeping AI triage honest

Three guardrails keep an AI triage layer trustworthy:

Fail gracefully. If the AI backend is unreachable, the alert and the raw metric context must still reach the on-call engineer. AI unavailability can never block detection.
Constrain the context. Mask anything sensitive in log lines before they leave your environment, and cap the amount of data sent to control both cost and hallucination risk.
Keep humans on remediation. Use AI to diagnose and suggest. Anything that restarts a broker, triggers a reassignment, or changes config goes through a human, ideally with a confidence threshold below which the system simply pages rather than suggesting.

A monitoring stack that holds up

The shape of a Kafka monitoring stack that survives contact with production is consistent: JMX exporter as a Java agent on every broker, a short whitelist of metrics that map to real failure modes, Prometheus scraping with sensible retention, Grafana for the human view, Alertmanager routing severity-tiered alerts, and a thin AI triage service that enriches alerts with correlated context before a human looks at them.

Notice what is not in that list: scraping every JMX bean, alerting on raw throughput, or letting an AI auto-remediate. Each of those is a way to generate noise or risk without adding signal. The discipline is the same one good observability has always required — measure the things that predict failure, alert on the things that demand action, and use automation to accelerate human judgment rather than replace it.

Start with the four critical broker metrics and consumer lag. Get those alerting cleanly with no flapping. Then layer Grafana dashboards and AI triage on top. A monitoring setup built that way will tell you about a failing disk before it becomes an offline partition, and it will tell you in plain language why.

— James