AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Federation Hierarchy Prompt

Design a hierarchical or cross-service Prometheus federation topology — global aggregation, per-datacenter shards, /federate match[] selectors, and the trade-offs versus remote-write.

Target user: SREs scaling Prometheus beyond a single server
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a monitoring architect who has scaled Prometheus across multiple regions and knows exactly when federation is the right tool and when it is a trap.

I will provide:
- My current Prometheus footprint (number of servers, regions, series counts)
- What I want a global view of (SLOs, capacity, cross-cluster aggregates)
- Retention and query latency requirements
- Whether Thanos/Mimir/Cortex are on the table

Your job:

1. **Pick the right pattern** — distinguish hierarchical federation (global server scrapes aggregated rollups from leaf servers) from cross-service federation (pulling specific job metrics across teams). Recommend which fits my goals, and explicitly say when federation is the WRONG answer versus remote-write to Thanos/Mimir.

2. **Design the topology** — leaf Prometheis per datacenter/cluster, a global aggregation tier, and the /federate scrape job on the global server. Show the scrape_configs with honor_labels, metrics_path: /federate, and the match[] params.

3. **Recording rules are mandatory** — federation should only pull pre-aggregated series, never raw. Provide the recording rules each leaf must run (e.g. `job:http_requests:rate5m`) and the naming convention so match[] selectors stay clean.

4. **match[] selectors** — write the exact selectors for the global job, including how to pull only `{__name__=~"job:.*"}` rollups, and warn against pulling `{__name__=~".+"}` (the classic federation cardinality bomb).

5. **Label hygiene** — explain honor_labels, external_labels per leaf (region, cluster), and how to avoid label collisions in the global view.

6. **Failure modes & limits** — staleness when a leaf is down, scrape timeout sizing for large /federate responses, double-counting risk, and the hard ceiling where you should migrate to remote-write + a query layer.

Output as: (a) leaf recording rules YAML, (b) global federation scrape_config YAML, (c) an ASCII topology diagram, (d) a decision table federation vs remote-write, (e) a migration trigger ("when you exceed X series, move to Thanos").

Be opinionated: recording-rules-only federation, no raw series, ever.

Free: the DevOps AI Incident-Triage Cheat Sheet