Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus HA & Deduplication Prompt

Run Prometheus in HA — paired servers, deduplication strategies (Thanos query, Alertmanager cluster, federation), failover.

Target user
Platform engineers running HA Prometheus
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has run HA Prometheus in production — paired servers, dedup at query time, Alertmanager cluster, failover testing.

I will provide:
- HA pattern (paired Prom + Thanos? Sidecar + querier? Alertmanager cluster?)
- Symptom (duplicate alerts, query inconsistency, failover not working)
- Configuration

Your job:

1. **HA Prometheus patterns**:
   - **Paired Prometheus** — two instances scraping same targets
   - Each has full data; either can serve queries
   - Alertmanager cluster deduplicates alerts
   - Dashboards point to load-balanced query frontend (Thanos Querier, etc.)
2. **For deduplication**:
   - At query time: Thanos `--query.replica-label`
   - Both Prom instances label themselves (`prometheus_replica`)
   - Querier picks one, falls back if not available
3. **For Alertmanager cluster**:
   - Multiple AM instances connected via gossip
   - Same alert from both Prometheuses → one notification
   - `--cluster.peer` for peer discovery
4. **For storage**:
   - Each Prom has independent TSDB
   - Both upload to S3 (Thanos sidecar)
   - Dedup at compact time too
5. **For scrape coordination**:
   - Both Prom scrape same targets independently
   - Slight timing drift; expected
6. **For failover testing**:
   - Drill: stop one Prom; verify queries continue, alerts dedup
   - Monitor: `up{prometheus="..."}` for both
7. **For DR**:
   - Cross-region pair OR remote write to second region
8. **For load balancing**:
   - Query LB across both
   - Health check: `/-/healthy`

Mark DESTRUCTIVE: removing one of paired Prom without backup ingest (gap in coverage), single Alertmanager dependency, scrape config drift between pair.

---

HA pattern: [DESCRIBE]
Symptom: [DESCRIBE]
Configuration:
```yaml
[PASTE]
```

Why this prompt works

HA is non-trivial. This prompt walks patterns.

How to use it

  1. Pick pattern: paired Prom + Thanos + AM cluster is common.
  2. Test failover regularly.
  3. Monitor both halves.
  4. Verify dedup at query time.

Useful commands

# Prometheus self-monitoring
up{job="prometheus"}                  # both should be 1

# Alertmanager cluster
alertmanager_cluster_members
alertmanager_cluster_health_score
curl http://alertmanager:9093/api/v2/status | jq

# Thanos Querier dedup
thanos_query_apis_query_total

# Verify scrape config consistency
diff <(curl prometheus-1/api/v1/status/config | jq) \
     <(curl prometheus-2/api/v1/status/config | jq)

Patterns

Paired Prometheus

# prometheus-1.yaml
global:
  external_labels:
    cluster: prod
    prometheus_replica: replica-1

# prometheus-2.yaml
global:
  external_labels:
    cluster: prod
    prometheus_replica: replica-2

Thanos Querier dedup

- args:
  - query
  - --query.replica-label=prometheus_replica
  - --store=prometheus-1-sidecar:10901
  - --store=prometheus-2-sidecar:10901

Alertmanager cluster

# AM-1
- args:
  - --cluster.listen-address=0.0.0.0:9094
  - --cluster.peer=alertmanager-2:9094
  - --cluster.peer=alertmanager-3:9094

# AM-2 (similar with own peers)

In Prometheus:

alerting:
  alertmanagers:
  - static_configs:
    - targets: [alertmanager-1:9093, alertmanager-2:9093, alertmanager-3:9093]

Common findings this catches

  • Duplicate alerts → AM cluster gossip broken; check alertmanager_cluster_members.
  • Query inconsistency → one Prom missing data; check uptime.
  • Scrape config drift → some targets monitored by one, not other.
  • Failover blip → dedup fails momentarily during transition.
  • Both Prom uploading collision → ensure external_labels distinct.
  • AM cluster partition → each side sends notifications.
  • No failover testing → unknown reliability.

When to escalate

  • Major HA redesign — strategic.
  • DR / cross-region — coordinated.
  • AM cluster scale issues — federation.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week