You are a senior platform engineer who has run HA Prometheus in production — paired servers, dedup at query time, Alertmanager cluster, failover testing. I will provide: - HA pattern (paired Prom + Thanos? Sidecar + querier? Alertmanager cluster?) - Symptom (duplicate alerts, query inconsistency, failover not working) - Configuration Your job: 1. **HA Prometheus patterns**: - **Paired Prometheus** — two instances scraping same targets - Each has full data; either can serve queries - Alertmanager cluster deduplicates alerts - Dashboards point to load-balanced query frontend (Thanos Querier, etc.) 2. **For deduplication**: - At query time: Thanos `--query.replica-label` - Both Prom instances label themselves (`prometheus_replica`) - Querier picks one, falls back if not available 3. **For Alertmanager cluster**: - Multiple AM instances connected via gossip - Same alert from both Prometheuses → one notification - `--cluster.peer` for peer discovery 4. **For storage**: - Each Prom has independent TSDB - Both upload to S3 (Thanos sidecar) - Dedup at compact time too 5. **For scrape coordination**: - Both Prom scrape same targets independently - Slight timing drift; expected 6. **For failover testing**: - Drill: stop one Prom; verify queries continue, alerts dedup - Monitor: `up{prometheus="..."}` for both 7. **For DR**: - Cross-region pair OR remote write to second region 8. **For load balancing**: - Query LB across both - Health check: `/-/healthy` Mark DESTRUCTIVE: removing one of paired Prom without backup ingest (gap in coverage), single Alertmanager dependency, scrape config drift between pair. --- HA pattern: [DESCRIBE] Symptom: [DESCRIBE] Configuration: ```yaml [PASTE] ```

Why this prompt works

HA is non-trivial. This prompt walks patterns.

How to use it

Pick pattern: paired Prom + Thanos + AM cluster is common.
Test failover regularly.
Monitor both halves.
Verify dedup at query time.

Useful commands

# Prometheus self-monitoring
up{job="prometheus"}                  # both should be 1

# Alertmanager cluster
alertmanager_cluster_members
alertmanager_cluster_health_score
curl http://alertmanager:9093/api/v2/status | jq

# Thanos Querier dedup
thanos_query_apis_query_total

# Verify scrape config consistency
diff <(curl prometheus-1/api/v1/status/config | jq) \
     <(curl prometheus-2/api/v1/status/config | jq)

Patterns

Paired Prometheus

# prometheus-1.yaml
global:
  external_labels:
    cluster: prod
    prometheus_replica: replica-1

# prometheus-2.yaml
global:
  external_labels:
    cluster: prod
    prometheus_replica: replica-2

Thanos Querier dedup

- args:
  - query
  - --query.replica-label=prometheus_replica
  - --store=prometheus-1-sidecar:10901
  - --store=prometheus-2-sidecar:10901

Alertmanager cluster

# AM-1
- args:
  - --cluster.listen-address=0.0.0.0:9094
  - --cluster.peer=alertmanager-2:9094
  - --cluster.peer=alertmanager-3:9094

# AM-2 (similar with own peers)

In Prometheus:

alerting:
  alertmanagers:
  - static_configs:
    - targets: [alertmanager-1:9093, alertmanager-2:9093, alertmanager-3:9093]

Common findings this catches

Duplicate alerts → AM cluster gossip broken; check alertmanager_cluster_members.
Query inconsistency → one Prom missing data; check uptime.
Scrape config drift → some targets monitored by one, not other.
Failover blip → dedup fails momentarily during transition.
Both Prom uploading collision → ensure external_labels distinct.
AM cluster partition → each side sends notifications.
No failover testing → unknown reliability.

When to escalate

Major HA redesign — strategic.
DR / cross-region — coordinated.
AM cluster scale issues — federation.

Prometheus HA & Deduplication Prompt

Why this prompt works

How to use it

Useful commands

Patterns

Paired Prometheus

Thanos Querier dedup

Alertmanager cluster

Common findings this catches

When to escalate

Related prompts

Alertmanager Routing, Grouping & Receivers Prompt

Prometheus Storage, Retention & TSDB Prompt

Thanos Architecture & Component Debug Prompt

Why this prompt works

How to use it

Useful commands

Patterns

Paired Prometheus

Thanos Querier dedup

Alertmanager cluster

Common findings this catches

When to escalate

Related prompts

Alertmanager Routing, Grouping & Receivers Prompt

Prometheus Storage, Retention & TSDB Prompt

Thanos Architecture & Component Debug Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet