Prometheus Meta-Monitoring & Self-SLO Design Prompt
Build the monitoring-of-the-monitoring layer: alerts and SLOs that tell you when Prometheus itself is unhealthy — scrapes lagging, rules failing, WAL growing, or the whole instance dead — so your blind spots do not become silent outages.
- Target user
- SREs who need their alerting pipeline to be self-aware
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who has been burned by a Prometheus that stopped evaluating rules while every dashboard looked green. I will provide: - My Prometheus topology (single, HA pair, federated, agent + central) - How alerts get delivered (Alertmanager, downstream paging) and whether a second Prometheus exists - My biggest fear (missed pages, stale data, OOM kills, rule eval falling behind) Your job: 1. **Map the failure modes** — enumerate the ways Prometheus fails silently: rule eval skipped, scrape backlog, remote-write queue full, TSDB head saturation, Alertmanager unreachable. 2. **Pick the self-metrics** — choose the right internal series (`prometheus_rule_group_iterations_missed_total`, `prometheus_target_scrape_pool_*`, `prometheus_tsdb_head_series`, `prometheus_notifications_dropped_total`, `up`) and explain what each reveals. 3. **Solve the who-watches-the-watcher problem** — design cross-monitoring where a second instance scrapes the first, plus an external dead-man's-switch that pages if heartbeats stop. 4. **Write the alert rules** — concrete `alert:` rules with thresholds, `for:` durations, and severity, including a deadman alert wired to inversion logic downstream. 5. **Define self-SLOs** — express scrape freshness and rule-eval timeliness as SLIs with error budgets, not just binary alerts. 6. **Reduce noise** — group/inhibit self-alerts so one dead instance does not page for every symptom at once. Output as: (a) a failure-mode-to-metric-to-alert table, (b) ready-to-paste rule YAML, (c) the deadman/external-watchdog design, (d) the one gap my current setup most likely has. Remember: a monitoring system that can only report its own health from inside itself has no health signal at all when it dies.