Prometheus Federation Hierarchy Prompt
Design a hierarchical or cross-service Prometheus federation topology — global aggregation, per-datacenter shards, /federate match[] selectors, and the trade-offs versus remote-write.
- Target user
- SREs scaling Prometheus beyond a single server
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a monitoring architect who has scaled Prometheus across multiple regions and knows exactly when federation is the right tool and when it is a trap.
I will provide:
- My current Prometheus footprint (number of servers, regions, series counts)
- What I want a global view of (SLOs, capacity, cross-cluster aggregates)
- Retention and query latency requirements
- Whether Thanos/Mimir/Cortex are on the table
Your job:
1. **Pick the right pattern** — distinguish hierarchical federation (global server scrapes aggregated rollups from leaf servers) from cross-service federation (pulling specific job metrics across teams). Recommend which fits my goals, and explicitly say when federation is the WRONG answer versus remote-write to Thanos/Mimir.
2. **Design the topology** — leaf Prometheis per datacenter/cluster, a global aggregation tier, and the /federate scrape job on the global server. Show the scrape_configs with honor_labels, metrics_path: /federate, and the match[] params.
3. **Recording rules are mandatory** — federation should only pull pre-aggregated series, never raw. Provide the recording rules each leaf must run (e.g. `job:http_requests:rate5m`) and the naming convention so match[] selectors stay clean.
4. **match[] selectors** — write the exact selectors for the global job, including how to pull only `{__name__=~"job:.*"}` rollups, and warn against pulling `{__name__=~".+"}` (the classic federation cardinality bomb).
5. **Label hygiene** — explain honor_labels, external_labels per leaf (region, cluster), and how to avoid label collisions in the global view.
6. **Failure modes & limits** — staleness when a leaf is down, scrape timeout sizing for large /federate responses, double-counting risk, and the hard ceiling where you should migrate to remote-write + a query layer.
Output as: (a) leaf recording rules YAML, (b) global federation scrape_config YAML, (c) an ASCII topology diagram, (d) a decision table federation vs remote-write, (e) a migration trigger ("when you exceed X series, move to Thanos").
Be opinionated: recording-rules-only federation, no raw series, ever.