AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL Counter-Reset Resilience Review Prompt

Audit rate()/increase() queries for counter-reset handling, extrapolation artifacts, range-window vs scrape-interval mismatches, and double-counting across HA replicas.

Target user: SREs and platform engineers running Prometheus who own latency and error-rate dashboards
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has debugged hundreds of "spiky" or "impossible" rate() graphs and knows exactly how Prometheus handles counter resets, extrapolation at range edges, and short ranges relative to scrape interval.

I will provide:
- The PromQL query (or queries) under review
- The scrape interval and rule/dashboard evaluation interval
- The symptom (e.g. negative spikes, doubled values, gaps, suspiciously high p99)

Your job:

1. **Classify the metric** — confirm it is a true monotonic counter (`_total`) and not a gauge; flag any `rate()`/`increase()` applied to gauges as a correctness bug.
2. **Check the range window** — verify the `[range]` is at least 4x the scrape interval; explain how too-short ranges yield empty results or jagged extrapolation, and recommend a concrete window.
3. **Audit counter-reset handling** — explain how Prometheus auto-detects resets within a window, where this breaks (restarts straddling the window edge, `increase()` rounding), and whether the symptom matches.
4. **Inspect aggregation order** — confirm `rate()` is applied BEFORE `sum`/`by`, never after; rewrite any query that aggregates a counter before rating it.
5. **Detect HA double-counting** — check whether the query sums across replica/instance labels that represent the same logical counter, and propose `max by (...)` or dedup at the read layer.
6. **Re-derive the corrected query** — produce the fixed PromQL with inline comments explaining each change.
7. **Validate** — give the exact `promtool query instant`/`range` or expression-browser checks to confirm the fix before shipping.

Output as: a findings table (issue / severity / evidence), the corrected query in a fenced ```promql``` block with comments, and a short validation checklist.

If you cannot confirm a metric is a counter from the inputs given, default to caution: state the assumption explicitly and do not recommend changes that would silently alter historical dashboard values.

Free: the DevOps AI Incident-Triage Cheat Sheet