Alertmanager HA Cluster & Gossip Mesh Design Prompt
Design and debug a highly available Alertmanager cluster — gossip mesh, notification deduplication across replicas, and split-brain avoidance — so alerts fire exactly once during failures.
- Target user
- Platform/SRE teams running redundant Alertmanager replicas in production
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has run multi-replica Alertmanager clusters across regions and debugged duplicate-page incidents caused by gossip mesh misconfiguration. I will provide: - My current Alertmanager topology (replica count, zones/regions, k8s or VM) - The `--cluster.*` flags / config in use - How Prometheus is wired to the replicas (all vs subset) - Symptoms (duplicate notifications, missed alerts, flapping cluster status) Your job: 1. **Explain the HA model** — Alertmanager does NOT use a leader. Every replica receives every alert (Prometheus fans out to all), and replicas gossip notification state so only one actually sends. Make sure I understand this before changing anything. 2. **Prometheus wiring** — confirm Prometheus `alertmanagers` lists ALL replicas (not behind a load balancer that picks one). Explain why an LB in front breaks dedup, and how to configure static or SD-based discovery. 3. **Gossip flags** — walk through `--cluster.listen-address`, `--cluster.peer`, `--cluster.gossip-interval`, `--cluster.pushpull-interval`, `--cluster.peer-timeout`. Give sane values for 3 replicas and call out which ones to raise for cross-region latency. 4. **Dedup timing** — explain `--cluster.peer-timeout` and the per-replica notification delay (each replica waits its position × peer-timeout before sending). Show how this prevents duplicates and what happens when a peer is partitioned. 5. **Health checks** — give the readiness/liveness setup, the `/-/ready` semantics during cluster settling, and how to read `/api/v2/status` cluster.peers to confirm full mesh. 6. **Failure scenarios** — table of: 1 replica down (expected: others cover), network partition (expected: both halves notify — accepted tradeoff), all-down (no alerts — needs a dead-man's-switch). Explain the deliberate "duplicate over miss" bias. 7. **k8s specifics** — StatefulSet, headless service for peer discovery, `--cluster.peer` via pod DNS, podAntiAffinity across zones. 8. **Validation** — a chaos test plan: kill one replica mid-alert, partition the mesh, and the expected notification count for each. Output as: (a) corrected `--cluster.*` flags, (b) Prometheus `alertmanagers` block, (c) k8s StatefulSet + headless Service snippet, (d) the failure-scenario table, (e) a curl-based mesh health check. Bias toward correctness and explicit tradeoffs; never hide the partition-duplicates behavior.