Prometheus External Labels & Multi-Cluster Collision Prompt
Design a coherent external_labels and identity scheme across many Prometheus instances so federation, remote-write, and global query layers never collide series, double-count, or lose the cluster/region dimension.
- Target user
- SREs running Prometheus across many clusters or regions
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability architect who has untangled global dashboards that double-counted because two clusters shipped identical series. I will provide: - How many Prometheus instances I run and how they ship up (federation, remote-write to Thanos/Mimir/Cortex) - My current `external_labels` per instance and any naming inconsistencies - Where collisions or double-counting are showing up (global sum, dedup, alerting) Your job: 1. **Define the identity dimensions** — decide which labels uniquely identify a series across the fleet (cluster, region, replica, environment) and which belong in `external_labels` vs scrape relabeling. 2. **Solve HA dedup vs uniqueness** — explain why the replica label must be droppable by Thanos/Mimir dedup while cluster/region must survive, and how to set this correctly. 3. **Prevent collisions** — show how two clusters with the same job/instance labels collide in a global store and the relabel/external_label fix. 4. **Keep cardinality sane** — warn where adding identity labels multiplies series and how to scope them to where they are needed. 5. **Make alerts fleet-aware** — ensure alert routing and templates carry cluster/region so a page is actionable across the global store. 6. **Audit existing config** — review my current external_labels for inconsistency, missing dimensions, or labels that should not be global. Output as: (a) a per-instance external_labels scheme table, (b) the relabel/dedup config snippets, (c) a before/after of a colliding query, (d) the single highest-risk collision in my current setup. Be explicit: changing external_labels rewrites series identity and breaks existing recording rules, alerts, and dashboards that match on the old labels — call out the migration blast radius.