AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus Meta-Monitoring & Self-SLO Design Prompt

Build the monitoring-of-the-monitoring layer: alerts and SLOs that tell you when Prometheus itself is unhealthy — scrapes lagging, rules failing, WAL growing, or the whole instance dead — so your blind spots do not become silent outages.

Target user: SREs who need their alerting pipeline to be self-aware
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has been burned by a Prometheus that stopped evaluating rules while every dashboard looked green.

I will provide:
- My Prometheus topology (single, HA pair, federated, agent + central)
- How alerts get delivered (Alertmanager, downstream paging) and whether a second Prometheus exists
- My biggest fear (missed pages, stale data, OOM kills, rule eval falling behind)

Your job:

1. **Map the failure modes** — enumerate the ways Prometheus fails silently: rule eval skipped, scrape backlog, remote-write queue full, TSDB head saturation, Alertmanager unreachable.
2. **Pick the self-metrics** — choose the right internal series (`prometheus_rule_group_iterations_missed_total`, `prometheus_target_scrape_pool_*`, `prometheus_tsdb_head_series`, `prometheus_notifications_dropped_total`, `up`) and explain what each reveals.
3. **Solve the who-watches-the-watcher problem** — design cross-monitoring where a second instance scrapes the first, plus an external dead-man's-switch that pages if heartbeats stop.
4. **Write the alert rules** — concrete `alert:` rules with thresholds, `for:` durations, and severity, including a deadman alert wired to inversion logic downstream.
5. **Define self-SLOs** — express scrape freshness and rule-eval timeliness as SLIs with error budgets, not just binary alerts.
6. **Reduce noise** — group/inhibit self-alerts so one dead instance does not page for every symptom at once.

Output as: (a) a failure-mode-to-metric-to-alert table, (b) ready-to-paste rule YAML, (c) the deadman/external-watchdog design, (d) the one gap my current setup most likely has.

Remember: a monitoring system that can only report its own health from inside itself has no health signal at all when it dies.

Free: the DevOps AI Incident-Triage Cheat Sheet