Prometheus Error Guide: 'alertmanager failed to join cluster' Gossip Failure
Fix Alertmanager 'failed to join cluster': open port 9094 TCP+UDP, set --cluster.advertise-address, and stop duplicate notifications from a non-converged gossip cluster.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #alertmanager
Exact Error Message
failed to join cluster is emitted by Alertmanager’s gossip/memberlist layer at startup when it cannot reach the peers passed via --cluster.peer:
level=warn ts=2026-06-27T09:14:02.118Z caller=cluster.go:267 component=cluster msg="failed to join cluster" err="3 errors occurred:\n\t* Failed to resolve alertmanager-1:9094: lookup alertmanager-1 on 10.96.0.10:53: no such host\n\t* Failed to resolve alertmanager-2:9094: lookup alertmanager-2 on 10.96.0.10:53: no such host\n\t* Failed to join 10.0.1.6:9094: dial tcp 10.0.1.6:9094: i/o timeout"
Alertmanager does not exit on this. It keeps running as a single-node cluster and logs that gossip never converges:
level=info ts=2026-06-27T09:14:02.119Z caller=cluster.go:704 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2026-06-27T09:14:12.221Z caller=cluster.go:729 component=cluster msg="gossip not settled" polls=5 before=1 now=1 elapsed=10.002s
level=info ts=2026-06-27T09:14:22.330Z caller=cluster.go:721 component=cluster msg="gossip not settled but continuing anyway" polls=10 elapsed=20.21s
before=1 now=1 is the tell: each poll sees only one member (itself). A healthy cluster settles at the full peer count.
What the Error Means
HA Alertmanager runs as a cluster of identical replicas that coordinate over a gossip/memberlist protocol (the same library Consul and Serf use). The cluster exists so that every alert is notified once: replicas share notification logs and silences, elect which instance sends each group, and de-duplicate so the same alert from N Alertmanagers becomes one page.
That coordination happens on the cluster port, not the API port:
--cluster.listen-address— where memberlist binds, default:9094(uses both TCP and UDP).--cluster.peer— repeated once per other replica, e.g.--cluster.peer=alertmanager-1:9094.--cluster.advertise-address— the address this replica tells peers to reach it on. Critical behind NAT/Kubernetes.
When a replica logs failed to join cluster and gossip never settles past now=1, it is operating alone. Every replica then evaluates the same alerts independently and fires its own notification, so you get duplicate pages/emails/Slack messages — one per replica — and silences created on one instance do not suppress alerts on the others. The receivers themselves work fine; the de-duplication layer is broken.
Common Causes
- Port 9094 blocked. A firewall, cloud security group, or Kubernetes NetworkPolicy drops traffic between replicas. Because the API on 9093 is usually open, the service looks healthy while gossip silently fails.
- UDP blocked, TCP allowed. memberlist needs both TCP and UDP on 9094. A rule that opens only TCP lets the cluster half-form and flap.
- Wrong / unresolvable peer hostnames.
--cluster.peerpoints at names that do not resolve (headless service not ready, typo, wrong namespace). - Wrong
--cluster.advertise-addressbehind NAT/Kubernetes. A replica advertises a pod-internal, container-local, or127.0.0.1address that peers cannot route to, so joins time out. - Only one peer reachable. With several
--cluster.peerentries but only one routable, the cluster forms partially and flaps. - Mismatched cluster on rolling restart. During a rollout, old and new pods briefly disagree on membership and log join failures until DNS catches up.
- StatefulSet headless DNS not ready at startup. Alertmanager boots before
alertmanager-operatedendpoints are populated, so peer resolution fails on the first attempt.
How to Reproduce the Error
Start two Alertmanagers but point one at a peer that does not exist (or is firewalled):
alertmanager --config.file=alertmanager.yml \
--cluster.listen-address=0.0.0.0:9094 \
--cluster.peer=alertmanager-1:9094 \
--cluster.peer=alertmanager-2:9094
If alertmanager-2 is unresolvable or 9094 is blocked, the log shows failed to join cluster and gossip not settled ... before=1 now=1. Fire a test alert and you will receive it twice — once from each isolated instance — confirming the broken de-dup.
Diagnostic Commands
All of the following are read-only. Start with the cluster status on the API port:
curl -s http://localhost:9093/api/v2/status | jq '.cluster'
A converged 3-node cluster:
{"status":"ready","peers":[
{"name":"01H8...","address":"10.0.1.5:9094"},
{"name":"01H8...","address":"10.0.1.6:9094"},
{"name":"01H8...","address":"10.0.1.7:9094"}]}
A degraded cluster shows only itself (and may report settling):
{"status":"ready","peers":[{"name":"01H8...","address":"10.0.1.5:9094"}]}
Test reachability of a peer’s cluster port — both TCP and UDP:
nc -vz alertmanager-1 9094 # TCP
nc -vzu alertmanager-1 9094 # UDP (memberlist needs this too)
Confirm the port is bound locally and on the right interface:
ss -lntup | grep 9094
Verify peer names resolve:
getent hosts alertmanager-1
Pull cluster lines from the journal:
journalctl -u alertmanager -n 80 --no-pager | grep -i cluster
In Kubernetes, check the headless service backing the StatefulSet:
kubectl get endpoints alertmanager-operated
If ENDPOINTS is empty or missing replicas, peers cannot discover each other and joins fail.
Step-by-Step Resolution
1. Open 9094 TCP and UDP between all replicas. This fixes the most common case. Allow both protocols among the Alertmanager peers — a security group or NetworkPolicy that only opens TCP will still break gossip:
# Kubernetes NetworkPolicy: allow 9094 TCP+UDP between alertmanager pods
- ports:
- protocol: TCP
port: 9094
- protocol: UDP
port: 9094
Re-test with nc -vz and nc -vzu before moving on.
2. Fix --cluster.advertise-address. Behind NAT or in Kubernetes, set it to the routable IP rather than letting memberlist auto-pick a container-local or loopback address. In a pod, advertise the pod IP:
args:
- --cluster.listen-address=0.0.0.0:9094
- --cluster.advertise-address=$(POD_IP):9094
env:
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
On a VM behind NAT, set it to the address peers actually dial.
3. Ensure every replica lists every other peer. Each instance needs --cluster.peer entries for the others (the Prometheus Operator generates these for you; for manual deployments, list them explicitly):
args:
- --cluster.peer=alertmanager-0.alertmanager-operated:9094
- --cluster.peer=alertmanager-1.alertmanager-operated:9094
- --cluster.peer=alertmanager-2.alertmanager-operated:9094
4. Verify DNS / the headless service. Confirm getent hosts resolves each peer and kubectl get endpoints alertmanager-operated lists all pods. If DNS is just slow at boot, the join retries and settles — the startup warning is harmless once now= reaches the full count.
5. Restart and watch gossip settle. Roll the replicas and confirm the cluster converges:
journalctl -u alertmanager -f | grep -i cluster
level=info caller=cluster.go:701 component=cluster msg="gossip settled; proceeding" elapsed=6.51s
Then re-check curl .../api/v2/status | jq '.cluster' and confirm peers lists all replicas. Fire a test alert and verify you receive it once.
Prevention and Best Practices
- Alert on the cluster size itself:
alertmanager_cluster_membersshould equal your replica count. A drop to 1 is your earliest signal of duplicate notifications. - Always pin
--cluster.advertise-addressto the routable IP in NAT/Kubernetes environments; do not rely on auto-detection. - Open 9094 for both TCP and UDP in every firewall, security group, and NetworkPolicy on the path between replicas.
- Prefer the Prometheus Operator / StatefulSet with the
alertmanager-operatedheadless service so peer DNS and--cluster.peerflags are managed for you. - Keep replica clocks in sync and run an odd replica count (typically 3) so de-duplication and silence propagation stay reliable.
Related Errors
alertmanager notifications failing— receivers, SMTP, and webhook delivery breaking. That is the delivery layer; this guide is the cluster/gossip layer. If notifications send but arrive duplicated, you are here, not there.alerts stuck pending / not firing— a rule-evaluation problem in Prometheus rather than an Alertmanager clustering one; the alert never reaches Alertmanager at all.
Frequently Asked Questions
Why am I getting duplicate alerts even though Alertmanager is running?
Because the gossip cluster is not converged. Each replica that cannot join the cluster evaluates and notifies independently, so one alert becomes N notifications. Check curl .../api/v2/status | jq '.cluster' — if peers lists only one member, that is the cause.
Is failed to join cluster fatal?
No. Alertmanager continues as a single-node cluster, which is exactly why the symptom is silent duplicate notifications rather than an outage. You must watch the cluster status and alertmanager_cluster_members, not just whether the process is up.
Does opening port 9094 over TCP fix it?
Not on its own. memberlist uses both TCP and UDP on 9094. If only TCP is allowed, the cluster half-forms and flaps. Test both with nc -vz and nc -vzu.
Why does this only happen in Kubernetes / behind NAT?
Because Alertmanager auto-detects an advertise address that peers cannot route to (a pod-internal or loopback IP). Set --cluster.advertise-address=$(POD_IP):9094 so it advertises a reachable address.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.