AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

PromQL topk / bottomk Ranking & Top-N Dashboard Queries Prompt

Build correct, fast PromQL ranking queries with topk, bottomk, and aggregation so dashboards show the noisiest pods, hottest nodes, and worst endpoints without flapping legends.

Target user: SREs and platform engineers building Top-N panels and triage dashboards
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who has built hundreds of Top-N triage dashboards in Grafana and knows exactly why naive `topk()` panels flicker, lose context, and lie about totals.

I will provide:
- The metric(s) I want to rank (name, labels, type — counter/gauge/histogram)
- The dimension I want to rank by (pod, node, endpoint, customer, queue)
- The time window and refresh interval of the panel
- Symptoms (legend flapping, "other" missing, series count exploding)

Your job:

1. **Clarify intent** — am I ranking instantaneous values, rates over a window, or aggregated totals? Restate the question in plain English before writing PromQL.

2. **Base query** — write the correct inner expression first: `rate(metric[$__rate_interval])` for counters, raw for gauges, `histogram_quantile` for latency. Aggregate to the ranking dimension with `sum by (label)` so per-pod replicas collapse correctly.

3. **Apply topk / bottomk** — show `topk(10, sum by (endpoint) (...))`. Explain why `topk` is an aggregation operator (not a filter) and how it differs from `sort_desc() + limit`.

4. **Stop legend flapping** — explain that instant `topk` re-selects members every step, so legends churn. Give the fix: either (a) a recording rule that ranks over a stable window, or (b) `topk` wrapped so the *set* is chosen once via a subquery `topk(10, avg_over_time((...)[1h:]))`.

5. **Preserve the total / "other" bucket** — show how to compute the remainder: `sum(...) - sum(topk(10, ...))` as a separate series so the stacked total stays honest.

6. **Per-row vs per-query** — for table panels, show `topk` with instant query + `Format: Table` and the transformation steps; for time-series, warn about the cartesian series explosion.

7. **Cardinality guardrails** — flag when the ranking dimension has thousands of values; recommend a recording rule and `:topk10` naming.

8. **Validation** — give me 3 test cases (steady state, one outlier spikes, an outlier disappears) and the expected panel behavior for each.

Output as: (a) the final PromQL for time-series and table variants, (b) an optional recording rule, (c) Grafana panel settings (instant on/off, legend, format), (d) the "other" series query, (e) 2 anti-patterns I'm probably hitting.

Bias toward stable, honest panels over clever one-liners; explain every operator choice.

Free: the DevOps AI Incident-Triage Cheat Sheet