Prometheus HA & Deduplication Prompt
Run Prometheus in HA — paired servers, deduplication strategies (Thanos query, Alertmanager cluster, federation), failover.
- Target user
- Platform engineers running HA Prometheus
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has run HA Prometheus in production — paired servers, dedup at query time, Alertmanager cluster, failover testing.
I will provide:
- HA pattern (paired Prom + Thanos? Sidecar + querier? Alertmanager cluster?)
- Symptom (duplicate alerts, query inconsistency, failover not working)
- Configuration
Your job:
1. **HA Prometheus patterns**:
- **Paired Prometheus** — two instances scraping same targets
- Each has full data; either can serve queries
- Alertmanager cluster deduplicates alerts
- Dashboards point to load-balanced query frontend (Thanos Querier, etc.)
2. **For deduplication**:
- At query time: Thanos `--query.replica-label`
- Both Prom instances label themselves (`prometheus_replica`)
- Querier picks one, falls back if not available
3. **For Alertmanager cluster**:
- Multiple AM instances connected via gossip
- Same alert from both Prometheuses → one notification
- `--cluster.peer` for peer discovery
4. **For storage**:
- Each Prom has independent TSDB
- Both upload to S3 (Thanos sidecar)
- Dedup at compact time too
5. **For scrape coordination**:
- Both Prom scrape same targets independently
- Slight timing drift; expected
6. **For failover testing**:
- Drill: stop one Prom; verify queries continue, alerts dedup
- Monitor: `up{prometheus="..."}` for both
7. **For DR**:
- Cross-region pair OR remote write to second region
8. **For load balancing**:
- Query LB across both
- Health check: `/-/healthy`
Mark DESTRUCTIVE: removing one of paired Prom without backup ingest (gap in coverage), single Alertmanager dependency, scrape config drift between pair.
---
HA pattern: [DESCRIBE]
Symptom: [DESCRIBE]
Configuration:
```yaml
[PASTE]
```
Why this prompt works
HA is non-trivial. This prompt walks patterns.
How to use it
- Pick pattern: paired Prom + Thanos + AM cluster is common.
- Test failover regularly.
- Monitor both halves.
- Verify dedup at query time.
Useful commands
# Prometheus self-monitoring
up{job="prometheus"} # both should be 1
# Alertmanager cluster
alertmanager_cluster_members
alertmanager_cluster_health_score
curl http://alertmanager:9093/api/v2/status | jq
# Thanos Querier dedup
thanos_query_apis_query_total
# Verify scrape config consistency
diff <(curl prometheus-1/api/v1/status/config | jq) \
<(curl prometheus-2/api/v1/status/config | jq)
Patterns
Paired Prometheus
# prometheus-1.yaml
global:
external_labels:
cluster: prod
prometheus_replica: replica-1
# prometheus-2.yaml
global:
external_labels:
cluster: prod
prometheus_replica: replica-2
Thanos Querier dedup
- args:
- query
- --query.replica-label=prometheus_replica
- --store=prometheus-1-sidecar:10901
- --store=prometheus-2-sidecar:10901
Alertmanager cluster
# AM-1
- args:
- --cluster.listen-address=0.0.0.0:9094
- --cluster.peer=alertmanager-2:9094
- --cluster.peer=alertmanager-3:9094
# AM-2 (similar with own peers)
In Prometheus:
alerting:
alertmanagers:
- static_configs:
- targets: [alertmanager-1:9093, alertmanager-2:9093, alertmanager-3:9093]
Common findings this catches
- Duplicate alerts → AM cluster gossip broken; check
alertmanager_cluster_members. - Query inconsistency → one Prom missing data; check uptime.
- Scrape config drift → some targets monitored by one, not other.
- Failover blip → dedup fails momentarily during transition.
- Both Prom uploading collision → ensure
external_labelsdistinct. - AM cluster partition → each side sends notifications.
- No failover testing → unknown reliability.
When to escalate
- Major HA redesign — strategic.
- DR / cross-region — coordinated.
- AM cluster scale issues — federation.
Related prompts
-
Alertmanager Routing, Grouping & Receivers Prompt
Design Alertmanager routes — receivers (Slack, PagerDuty), grouping, inhibition, repeat intervals, mute timings.
-
Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.
-
Thanos Architecture & Component Debug Prompt
Operate Thanos — Sidecar, Receive, Store Gateway, Compactor, Querier, Ruler; troubleshoot dedup, downsampling, S3 issues.