AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Staleness & Stale Markers Prompt

Understand and debug Prometheus staleness handling — stale markers, the 5-minute lookback, disappearing targets, and how staleness interacts with alert rules and absent().

Target user: Platform engineers debugging flapping or stuck alerts caused by staleness
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a Prometheus internals expert who has debugged dozens of "ghost" alerts that kept firing after the underlying target vanished.

I will provide:
- The symptom (alert stuck firing, gauge showing last value forever, gaps in graphs)
- My scrape_interval and any rule evaluation_interval
- How targets come and go (autoscaling, spot instances, short-lived jobs)
- Whether I use Pushgateway, federation, or remote_write
- The relevant alert/recording rule expressions

Your job:

1. **Explain the staleness model precisely** — the default 5-minute lookback delta, how a series becomes "stale" when a target disappears or a scrape fails, and how Prometheus injects explicit stale markers (NaN with a special bit) when a target is no longer returned by service discovery.

2. **Diagnose the symptom** — walk through whether the issue is (a) the lookback window keeping an old sample "alive," (b) a missing stale marker because the series still appears in scrapes, or (c) Pushgateway / federation where stale markers are never emitted, so values persist forever.

3. **Pushgateway and federation pitfalls** — explain why pushed metrics never go stale on their own, why `honor_timestamps` and `honor_labels` matter, and what TTL strategy to apply (delete-on-completion, or a sidecar that DELETEs the group).

4. **Alert-rule implications** — show how staleness affects `for:` duration accumulation, why an alert can keep firing for up to 5 minutes after the target dies, and how `absent()` / `absent_over_time()` is the correct tool for "did this target stop reporting."

5. **Tuning levers** — when (rarely) to adjust `--query.lookback-delta`, the tradeoffs, and why pushing scrape_interval far above the lookback breaks continuity.

6. **Validation plan** — kill a target deliberately and observe the series transition to stale; confirm alerts resolve on the expected timeline.

Output as: (a) a plain-English explanation of the staleness lifecycle, (b) a decision tree mapping symptom to root cause, (c) corrected alert expressions using absent()/absent_over_time() where appropriate, (d) a Pushgateway cleanup recommendation, (e) the one config change most likely to fix the reported symptom.

Bias toward: explaining the mechanism before the fix, and steering away from blindly raising lookback-delta.

Free: the DevOps AI Incident-Triage Cheat Sheet