Prometheus High Availability and Federation, Done Right

A single Prometheus is a single point of failure for your visibility, and when it goes down you’re flying blind during exactly the moments you can’t afford to. So people run two. And then the graphs flicker between replicas, alerts double-fire, and someone bolts on federation that quietly melts the parent server. Prometheus HA is genuinely simple in concept and full of sharp edges in practice. Here’s the architecture that holds up.

HA the Prometheus way: run two identical replicas

Prometheus has no built-in clustering, and that’s deliberate. The supported HA pattern is almost crude: run two (or more) identical Prometheus instances, with the same scrape config, scraping the same targets independently. Each is a complete, standalone copy. There’s no leader election, no shared state, no consensus protocol to break.

# Both replicas get an identical config plus a distinguishing label
global:
  external_labels:
    cluster: prod-us-east
    replica: A   # B on the other instance

The replica external label is what lets a downstream system tell them apart and de-duplicate later. Everything else is identical on purpose.

The flickering-graph problem and how to kill it

Point Grafana at two replicas behind a round-robin load balancer and your graphs flicker — each request hits a different replica whose scrape happened a few seconds offset from the other, so the latest point jumps around. The fix is de-duplication, and you don’t do it in the load balancer.

You do it in a query layer that understands the replica label: Thanos Querier or Mimir. It reads from both replicas, drops the replica label, and merges the series so the gaps in one are filled by the other:

# Thanos Querier
--query.replica-label=replica

This is the load-bearing insight people miss: HA dedup belongs in the query layer, not the LB. A naive LB gives you flicker; a replica-aware querier gives you a seamless, gap-filled series even when one replica was down for a scrape. For the long-term-storage side of this, see Thanos vs Mimir.

Alerting HA: send duplicates on purpose

Both replicas run the same alert rules, so both fire the same alerts. You do not try to prevent that. Instead, both send to a clustered Alertmanager, and Alertmanager deduplicates identical alerts before they reach a human:

# Each Prometheus points at ALL Alertmanagers
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['am-1:9093', 'am-2:9093', 'am-3:9093']

# Alertmanagers gossip to form a cluster
alertmanager --cluster.peer=am-2:9094 --cluster.peer=am-3:9094

The Alertmanager cluster shares notification state over its gossip protocol, so even with both Prometheus replicas firing and three Alertmanagers running, you get one page. Send everything, dedup at Alertmanager — same philosophy as the metrics path.

Federation: a sharp tool, not a scaling strategy

Federation lets one Prometheus scrape aggregated metrics from another via the /federate endpoint. The correct use is hierarchical aggregation: per-cluster Prometheus servers compute rolled-up recording rules, and a global Prometheus federates only those small aggregates for a cross-cluster view.

scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        # ONLY pull pre-aggregated recording-rule output
        - '{__name__=~"job:.*"}'
        - '{__name__=~"cluster:.*"}'
    static_configs:
      - targets: ['prom-cluster-a:9090', 'prom-cluster-b:9090']

The number-one federation mistake is matching {__name__=~".+"} and trying to pull every series up to a parent. That copies the full cardinality of every child into the parent, the parent falls over, and you blame federation. Federation is for a small set of aggregates, never for centralizing raw metrics. If you want all the raw data centrally, that’s a Thanos/Mimir job, not federation.

Recording rules make federation viable

Because you should only federate aggregates, you have to create those aggregates first — with recording rules on each child:

groups:
  - name: federation-aggregates
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      - record: cluster:cpu_usage:ratio
        expr: 1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))

The global server federates job:* and cluster:* and stays small and fast. The naming convention (level:metric:operation) makes the match selectors trivial.

When to stop federating and adopt Thanos/Mimir

Federation has a ceiling. Once you want global raw data, long retention, downsampling, or a single query plane over dozens of clusters, federation becomes a liability and you’ve outgrown it. That’s the boundary where a remote-write-based system (Thanos Receive or Mimir) earns its complexity. A rough rule I use: federation is fine for a handful of clusters and a few hundred aggregate series per child; past that, ship to a horizontally-scaled backend instead.

A reference architecture

The setup I’d stand up today for multi-cluster HA:

Per cluster: two identical Prometheus replicas (replica: A/B), scraping locally, computing aggregate recording rules.
Alerting: both replicas point at a 3-node Alertmanager cluster that dedups.
Query/HA dedup: Thanos Querier (or Mimir) over the replicas, configured with --query.replica-label=replica.
Cross-cluster view: either federate job:*/cluster:* aggregates to a global Prometheus, or — at scale — remote-write everything to Thanos/Mimir and skip federation.
Routing: Alertmanager feeds your normal monitoring alert pipeline.

That’s resilient to a replica dying, an Alertmanager dying, and a whole cluster going dark — without flickering graphs or duplicate pages.

The throughline

Prometheus HA isn’t about preventing duplication — it’s about embracing it and deduplicating at the right layer: the query layer for metrics, Alertmanager for alerts. Federation is a precision tool for shipping aggregates upward, not a bulk data mover. Keep those two ideas straight and an HA Prometheus stack stops being fragile.

Prometheus, Thanos, and Alertmanager configurations change across versions. Validate this architecture against your own environment and the official docs before deploying.