Prometheus Error Guide: 'alertmanager failed to join

Exact Error Message

failed to join cluster is emitted by Alertmanager’s gossip/memberlist layer at startup when it cannot reach the peers passed via --cluster.peer:

level=warn ts=2026-06-27T09:14:02.118Z caller=cluster.go:267 component=cluster msg="failed to join cluster" err="3 errors occurred:\n\t* Failed to resolve alertmanager-1:9094: lookup alertmanager-1 on 10.96.0.10:53: no such host\n\t* Failed to resolve alertmanager-2:9094: lookup alertmanager-2 on 10.96.0.10:53: no such host\n\t* Failed to join 10.0.1.6:9094: dial tcp 10.0.1.6:9094: i/o timeout"

Alertmanager does not exit on this. It keeps running as a single-node cluster and logs that gossip never converges:

level=info ts=2026-06-27T09:14:02.119Z caller=cluster.go:704 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2026-06-27T09:14:12.221Z caller=cluster.go:729 component=cluster msg="gossip not settled" polls=5 before=1 now=1 elapsed=10.002s
level=info ts=2026-06-27T09:14:22.330Z caller=cluster.go:721 component=cluster msg="gossip not settled but continuing anyway" polls=10 elapsed=20.21s

before=1 now=1 is the tell: each poll sees only one member (itself). A healthy cluster settles at the full peer count.

What the Error Means

HA Alertmanager runs as a cluster of identical replicas that coordinate over a gossip/memberlist protocol (the same library Consul and Serf use). The cluster exists so that every alert is notified once: replicas share notification logs and silences, elect which instance sends each group, and de-duplicate so the same alert from N Alertmanagers becomes one page.

That coordination happens on the cluster port, not the API port:

--cluster.listen-address — where memberlist binds, default :9094 (uses both TCP and UDP).
--cluster.peer — repeated once per other replica, e.g. --cluster.peer=alertmanager-1:9094.
--cluster.advertise-address — the address this replica tells peers to reach it on. Critical behind NAT/Kubernetes.

When a replica logs failed to join cluster and gossip never settles past now=1, it is operating alone. Every replica then evaluates the same alerts independently and fires its own notification, so you get duplicate pages/emails/Slack messages — one per replica — and silences created on one instance do not suppress alerts on the others. The receivers themselves work fine; the de-duplication layer is broken.

Common Causes

Port 9094 blocked. A firewall, cloud security group, or Kubernetes NetworkPolicy drops traffic between replicas. Because the API on 9093 is usually open, the service looks healthy while gossip silently fails.
UDP blocked, TCP allowed. memberlist needs both TCP and UDP on 9094. A rule that opens only TCP lets the cluster half-form and flap.
Wrong / unresolvable peer hostnames. --cluster.peer points at names that do not resolve (headless service not ready, typo, wrong namespace).
Wrong --cluster.advertise-address behind NAT/Kubernetes. A replica advertises a pod-internal, container-local, or 127.0.0.1 address that peers cannot route to, so joins time out.
Only one peer reachable. With several --cluster.peer entries but only one routable, the cluster forms partially and flaps.
Mismatched cluster on rolling restart. During a rollout, old and new pods briefly disagree on membership and log join failures until DNS catches up.
StatefulSet headless DNS not ready at startup. Alertmanager boots before alertmanager-operated endpoints are populated, so peer resolution fails on the first attempt.

How to Reproduce the Error

Start two Alertmanagers but point one at a peer that does not exist (or is firewalled):

alertmanager --config.file=alertmanager.yml \
  --cluster.listen-address=0.0.0.0:9094 \
  --cluster.peer=alertmanager-1:9094 \
  --cluster.peer=alertmanager-2:9094

If alertmanager-2 is unresolvable or 9094 is blocked, the log shows failed to join cluster and gossip not settled ... before=1 now=1. Fire a test alert and you will receive it twice — once from each isolated instance — confirming the broken de-dup.

Diagnostic Commands

All of the following are read-only. Start with the cluster status on the API port:

curl -s http://localhost:9093/api/v2/status | jq '.cluster'

A converged 3-node cluster:

{"status":"ready","peers":[
  {"name":"01H8...","address":"10.0.1.5:9094"},
  {"name":"01H8...","address":"10.0.1.6:9094"},
  {"name":"01H8...","address":"10.0.1.7:9094"}]}

A degraded cluster shows only itself (and may report settling):

{"status":"ready","peers":[{"name":"01H8...","address":"10.0.1.5:9094"}]}

Test reachability of a peer’s cluster port — both TCP and UDP:

nc -vz alertmanager-1 9094     # TCP
nc -vzu alertmanager-1 9094    # UDP (memberlist needs this too)

Confirm the port is bound locally and on the right interface:

ss -lntup | grep 9094

Verify peer names resolve:

getent hosts alertmanager-1

Pull cluster lines from the journal:

journalctl -u alertmanager -n 80 --no-pager | grep -i cluster

In Kubernetes, check the headless service backing the StatefulSet:

kubectl get endpoints alertmanager-operated

If ENDPOINTS is empty or missing replicas, peers cannot discover each other and joins fail.

Step-by-Step Resolution

1. Open 9094 TCP and UDP between all replicas. This fixes the most common case. Allow both protocols among the Alertmanager peers — a security group or NetworkPolicy that only opens TCP will still break gossip:

# Kubernetes NetworkPolicy: allow 9094 TCP+UDP between alertmanager pods
- ports:
    - protocol: TCP
      port: 9094
    - protocol: UDP
      port: 9094

Re-test with nc -vz and nc -vzu before moving on.

2. Fix --cluster.advertise-address. Behind NAT or in Kubernetes, set it to the routable IP rather than letting memberlist auto-pick a container-local or loopback address. In a pod, advertise the pod IP:

args:
  - --cluster.listen-address=0.0.0.0:9094
  - --cluster.advertise-address=$(POD_IP):9094
env:
  - name: POD_IP
    valueFrom:
      fieldRef:
        fieldPath: status.podIP

On a VM behind NAT, set it to the address peers actually dial.

3. Ensure every replica lists every other peer. Each instance needs --cluster.peer entries for the others (the Prometheus Operator generates these for you; for manual deployments, list them explicitly):

args:
  - --cluster.peer=alertmanager-0.alertmanager-operated:9094
  - --cluster.peer=alertmanager-1.alertmanager-operated:9094
  - --cluster.peer=alertmanager-2.alertmanager-operated:9094

4. Verify DNS / the headless service. Confirm getent hosts resolves each peer and kubectl get endpoints alertmanager-operated lists all pods. If DNS is just slow at boot, the join retries and settles — the startup warning is harmless once now= reaches the full count.

5. Restart and watch gossip settle. Roll the replicas and confirm the cluster converges:

journalctl -u alertmanager -f | grep -i cluster

level=info caller=cluster.go:701 component=cluster msg="gossip settled; proceeding" elapsed=6.51s

Then re-check curl .../api/v2/status | jq '.cluster' and confirm peers lists all replicas. Fire a test alert and verify you receive it once.

Prevention and Best Practices

Alert on the cluster size itself: alertmanager_cluster_members should equal your replica count. A drop to 1 is your earliest signal of duplicate notifications.
Always pin --cluster.advertise-address to the routable IP in NAT/Kubernetes environments; do not rely on auto-detection.
Open 9094 for both TCP and UDP in every firewall, security group, and NetworkPolicy on the path between replicas.
Prefer the Prometheus Operator / StatefulSet with the alertmanager-operated headless service so peer DNS and --cluster.peer flags are managed for you.
Keep replica clocks in sync and run an odd replica count (typically 3) so de-duplication and silence propagation stay reliable.

alertmanager notifications failing — receivers, SMTP, and webhook delivery breaking. That is the delivery layer; this guide is the cluster/gossip layer. If notifications send but arrive duplicated, you are here, not there.
alerts stuck pending / not firing — a rule-evaluation problem in Prometheus rather than an Alertmanager clustering one; the alert never reaches Alertmanager at all.

Frequently Asked Questions

Why am I getting duplicate alerts even though Alertmanager is running? Because the gossip cluster is not converged. Each replica that cannot join the cluster evaluates and notifies independently, so one alert becomes N notifications. Check curl .../api/v2/status | jq '.cluster' — if peers lists only one member, that is the cause.

Is failed to join cluster fatal? No. Alertmanager continues as a single-node cluster, which is exactly why the symptom is silent duplicate notifications rather than an outage. You must watch the cluster status and alertmanager_cluster_members, not just whether the process is up.

Does opening port 9094 over TCP fix it? Not on its own. memberlist uses both TCP and UDP on 9094. If only TCP is allowed, the cluster half-forms and flaps. Test both with nc -vz and nc -vzu.

Why does this only happen in Kubernetes / behind NAT? Because Alertmanager auto-detects an advertise address that peers cannot route to (a pod-internal or loopback IP). Set --cluster.advertise-address=$(POD_IP):9094 so it advertises a reachable address.

Prometheus Error Guide: 'alertmanager failed to join cluster' Gossip Failure

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit