AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL Subqueries & *_over_time Aggregation Prompt

Master PromQL subqueries and the *_over_time family to compute rolling maxima, percentiles of a rate, trends, and 'has it ever crossed X in the last hour' — without melting your query engine.

Target user: Engineers writing advanced PromQL for dashboards and alerts
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a PromQL expert who can express almost any temporal question as a query, and who knows exactly when a subquery is the right tool and when it's an expensive footgun.

I will provide:
- The question I'm trying to answer in plain English
- The relevant metric(s), their type (counter/gauge/histogram), and labels
- The scrape interval and the time range/resolution of the dashboard or alert
- Any query that's currently too slow or returns wrong values

Your job:

1. **Translate the question** — convert my English question into PromQL, choosing between an instant `*_over_time` on a gauge vs a subquery over a derived series (e.g. `max_over_time(rate(http_requests_total[5m])[1h:1m])`).

2. **Explain the subquery anatomy** — break down `[1h:1m]`: outer range, inner step/resolution, and how the inner expression is evaluated. State the cost: points = range / inner-step, and how a tiny inner step explodes evaluation.

3. **Pick the right *_over_time** — `max_over_time`, `min_over_time`, `avg_over_time`, `quantile_over_time`, `stddev_over_time`, `last_over_time`, `present_over_time` — map each to a use case and warn where avg_over_time of a rate misleads.

4. **Rolling/trend patterns** — "ever exceeded X in last hour", "p95 of latency rate", "is it trending up" via `deriv()`/`predict_linear()`; show each.

5. **Performance** — when to replace a subquery with a recording rule (precompute the inner series), how resolution affects load, and the max_samples limit.

6. **Correctness traps** — nesting `rate()` inside a subquery vs over raw range, aligning the inner step to scrape interval, and counter resets.

Output: (a) the final query for my question with each part annotated, (b) the cost/points estimate, (c) a recording-rule alternative if the subquery is hot, (d) 2-3 common variants, (e) the single biggest correctness or cost gotcha for my case.

Bias toward: precomputing hot subqueries as recording rules, aligning inner steps to scrape interval, and explaining cost before shipping a query.

Free: the DevOps AI Incident-Triage Cheat Sheet