Debugging RabbitMQ Connection and Channel Leaks With AI

The alert was “RabbitMQ refusing new connections,” and it fired at 3 a.m. on a Saturday with no deploy in sight. The broker had hit its connection limit. Nothing had changed except time — a service had been leaking connections for days, opening a fresh one per request and never closing it, and it finally crossed the line. Connection and channel leaks are the slow puncture of RabbitMQ operations: invisible for a long while, then suddenly everything that needs a new connection fails at once.

These are great problems to debug with AI, because the evidence is structured — rabbitmqctl gives you per-connection and per-channel detail with client properties — and the work is pattern-matching a leaking client out of a list of hundreds. The model is fast at that triage and good at translating a leak’s signature into “here’s the bug in your client code.” It still can’t deploy the fix or watch the count flatten afterward, which is the part that proves you actually solved it.

Confirm it’s a leak, not just load

A leak grows monotonically and never recovers. Real load goes up and down. The first job is telling them apart, and that means looking at the trend and the per-source breakdown together.

# Total connections and channels right now
rabbitmqctl list_connections | wc -l
rabbitmqctl list_channels | wc -l

# The limits you're heading toward
rabbitmqctl environment | grep -i "connection_max\|channel_max"

# Which client IPs / users hold the most connections?
rabbitmqctl list_connections peer_host user channels | sort | uniq -c | sort -rn | head

That last command is the money shot. A leak almost always concentrates in one client host or user. If one IP holds 800 connections and the rest hold a handful each, you’ve found your suspect without reading a single line of application code.

Hand the AI the connection signature

The client properties RabbitMQ records often identify the exact library and version, which tells the AI a lot about the likely bug. Pull the detail and let it pattern-match.

# Rich per-connection detail including channel count and client info
rabbitmqctl list_connections name peer_host user channels \
  client_properties state

RabbitMQ is near its connection limit. One client host holds 800+ connections, each with a single channel, all in running state. The client_properties show a Python pika client. The service handles roughly 800 requests/minute. What’s the most likely bug, and how do I confirm it from the broker side before I touch the code?

The model will land on the classic pika mistake almost immediately: opening a connection per request inside a handler instead of reusing a long-lived connection (or a pool) across requests. The “one connection per request/minute that never closes” ratio is the fingerprint. A good answer also tells you how to confirm from the broker — watch whether connection age skews young and whether the count tracks request volume cumulatively rather than concurrently.

Channel leaks look different — and the AI should know that

Connection leaks and channel leaks have different signatures, and conflating them sends you to the wrong code. A channel leak shows up as a small number of connections each carrying a growing number of channels.

# Connections sorted by channel count — channel leaks float to the top
rabbitmqctl list_connections name peer_host channels | sort -k3 -rn | head

When I see ten connections each holding hundreds of channels, that’s a channel leak — code that opens a channel per operation and forgets to close it, often a try without a finally. Ask the AI to distinguish them explicitly: “Is this a connection leak or a channel leak, and what code pattern causes each?” The answer guides whether you hunt for unclosed connections or unclosed channels, which are different bugs in different places.

Set the guardrail while you fix the root cause

The leak is a client bug, but the broker shouldn’t let one misbehaving service exhaust connections for everyone. Cap per-user connections and channels so the blast radius is contained, then go fix the client.

# Limit connections and channels per user as a safety net
rabbitmqctl set_user_limits payments_svc \
  '{"max-connections":50,"max-channels":200}'

# Verify it took
rabbitmqctl list_user_limits

When the AI suggests only the client fix, I push it: “Give me a broker-side guardrail too, so the next leak can’t take down the whole cluster.” The user-limits cap is exactly that — it turns a cluster-wide outage into one service’s self-inflicted error.

Where AI overreaches

It sometimes recommends raising connection_max to “make the alert go away.” That’s the watermark mistake in a different costume — you’re giving the leak more room to grow before it bites, not fixing it. Push back: “Is raising the limit masking a leak? How do I confirm the count flattens after the real fix?” The honest answer is that the only proof is watching the connection count stop climbing once the client is patched.

It can also misread a connection churn pattern (rapid open/close, which is wasteful but not a leak) as a leak. The tell is connection age — leaked connections get old, churned ones stay young. Ask it to check age, and validate on staging by watching whether closing the service drops the count to zero.

My loop

Pull the per-host and per-channel breakdowns, hand the signature to the AI, and let it classify connection-leak vs channel-leak and name the client code pattern. Set the user-limits guardrail immediately, deploy the client fix, then watch list_connections | wc -l flatten over the next hour. The AI does the triage and pins the bug; the flattening count is the only proof I trust.

I keep these leak-triage prompts with my other prompts, and the RabbitMQ category has more on the connection-handling patterns — long-lived connections, channel-per-thread — that prevent these 3 a.m. pages in the first place.