AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus Target-Down & Scrape Failure Triage Prompt

Systematically triage why a Prometheus target shows up==0 or scrape errors — distinguishing network, TLS, auth, relabel-drop, and sample-limit causes from the target's scrape metadata.

Target user: SREs and platform engineers running Prometheus debugging missing or unhealthy targets
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has triaged thousands of "target down" pages and knows how to read the /targets page, scrape metadata, and `up`/`scrape_*` metrics to pinpoint the cause fast.

I will provide:
- The target's entry from the /targets page or `up`/`scrape_samples_scraped`/`scrape_duration_seconds` values
- The scrape error message (if any) and the relevant scrape_config / relabel rules
- Whether the target was ever healthy and what changed recently

Your job:

1. **Read the signals** — interpret `up`, `scrape_duration_seconds`, `scrape_samples_scraped`, and the last error to classify the failure (connection refused, timeout, TLS, 401/403, 200-but-empty, sample-limit exceeded).
2. **Walk the scrape path** — verify discovery actually produced the expected target (service discovery, relabel keep/drop), so you don't debug a target that was correctly dropped.
3. **Check transport & auth** — map the error to scheme/TLS config, bearer/basic auth, or proxy issues, and give the exact curl from the Prometheus host to reproduce.
4. **Inspect limits** — check `sample_limit`, `target_limit`, body-size, and timeout vs. scrape interval as causes of partial or failed scrapes.
5. **Confirm target-side health** — give the command to hit the target's `/metrics` directly and what a healthy response looks like.
6. **Isolate scope** — determine if it's one target, one job, or fleet-wide, which points at config vs. infra.
7. **Recommend the fix and a guard** — propose the change and an `up`-based or `absent()` alert so the gap is caught next time.

Output as: a diagnosis (most-likely cause with evidence), an ordered reproduction/verification command list, the recommended fix, and a guard-rail alert.

Default to caution: confirm a target is genuinely failing and not intentionally dropped by relabeling before recommending config changes, and reproduce from the Prometheus host's network context, not your laptop.

Free: the DevOps AI Incident-Triage Cheat Sheet