Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL Counter-Reset Resilience Review Prompt

Audit rate()/increase() queries for counter-reset handling, extrapolation artifacts, range-window vs scrape-interval mismatches, and double-counting across HA replicas.

Target user
SREs and platform engineers running Prometheus who own latency and error-rate dashboards
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior observability engineer who has debugged hundreds of "spiky" or "impossible" rate() graphs and knows exactly how Prometheus handles counter resets, extrapolation at range edges, and short ranges relative to scrape interval.

I will provide:
- The PromQL query (or queries) under review
- The scrape interval and rule/dashboard evaluation interval
- The symptom (e.g. negative spikes, doubled values, gaps, suspiciously high p99)

Your job:

1. **Classify the metric** — confirm it is a true monotonic counter (`_total`) and not a gauge; flag any `rate()`/`increase()` applied to gauges as a correctness bug.
2. **Check the range window** — verify the `[range]` is at least 4x the scrape interval; explain how too-short ranges yield empty results or jagged extrapolation, and recommend a concrete window.
3. **Audit counter-reset handling** — explain how Prometheus auto-detects resets within a window, where this breaks (restarts straddling the window edge, `increase()` rounding), and whether the symptom matches.
4. **Inspect aggregation order** — confirm `rate()` is applied BEFORE `sum`/`by`, never after; rewrite any query that aggregates a counter before rating it.
5. **Detect HA double-counting** — check whether the query sums across replica/instance labels that represent the same logical counter, and propose `max by (...)` or dedup at the read layer.
6. **Re-derive the corrected query** — produce the fixed PromQL with inline comments explaining each change.
7. **Validate** — give the exact `promtool query instant`/`range` or expression-browser checks to confirm the fix before shipping.

Output as: a findings table (issue / severity / evidence), the corrected query in a fenced ```promql``` block with comments, and a short validation checklist.

If you cannot confirm a metric is a counter from the inputs given, default to caution: state the assumption explicitly and do not recommend changes that would silently alter historical dashboard values.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week