AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Alertmanager HA Cluster & Gossip Mesh Design Prompt

Design and debug a highly available Alertmanager cluster — gossip mesh, notification deduplication across replicas, and split-brain avoidance — so alerts fire exactly once during failures.

Target user: Platform/SRE teams running redundant Alertmanager replicas in production
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who has run multi-replica Alertmanager clusters across regions and debugged duplicate-page incidents caused by gossip mesh misconfiguration.

I will provide:
- My current Alertmanager topology (replica count, zones/regions, k8s or VM)
- The `--cluster.*` flags / config in use
- How Prometheus is wired to the replicas (all vs subset)
- Symptoms (duplicate notifications, missed alerts, flapping cluster status)

Your job:

1. **Explain the HA model** — Alertmanager does NOT use a leader. Every replica receives every alert (Prometheus fans out to all), and replicas gossip notification state so only one actually sends. Make sure I understand this before changing anything.

2. **Prometheus wiring** — confirm Prometheus `alertmanagers` lists ALL replicas (not behind a load balancer that picks one). Explain why an LB in front breaks dedup, and how to configure static or SD-based discovery.

3. **Gossip flags** — walk through `--cluster.listen-address`, `--cluster.peer`, `--cluster.gossip-interval`, `--cluster.pushpull-interval`, `--cluster.peer-timeout`. Give sane values for 3 replicas and call out which ones to raise for cross-region latency.

4. **Dedup timing** — explain `--cluster.peer-timeout` and the per-replica notification delay (each replica waits its position × peer-timeout before sending). Show how this prevents duplicates and what happens when a peer is partitioned.

5. **Health checks** — give the readiness/liveness setup, the `/-/ready` semantics during cluster settling, and how to read `/api/v2/status` cluster.peers to confirm full mesh.

6. **Failure scenarios** — table of: 1 replica down (expected: others cover), network partition (expected: both halves notify — accepted tradeoff), all-down (no alerts — needs a dead-man's-switch). Explain the deliberate "duplicate over miss" bias.

7. **k8s specifics** — StatefulSet, headless service for peer discovery, `--cluster.peer` via pod DNS, podAntiAffinity across zones.

8. **Validation** — a chaos test plan: kill one replica mid-alert, partition the mesh, and the expected notification count for each.

Output as: (a) corrected `--cluster.*` flags, (b) Prometheus `alertmanagers` block, (c) k8s StatefulSet + headless Service snippet, (d) the failure-scenario table, (e) a curl-based mesh health check.

Bias toward correctness and explicit tradeoffs; never hide the partition-duplicates behavior.

Free: the DevOps AI Incident-Triage Cheat Sheet