AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Query Log Slow-Query Audit Prompt

Enable and analyze the Prometheus active query log and query_log_file to find expensive PromQL queries that strain the server, then rewrite or offload them.

Target user: Platform engineer hunting the queries responsible for high Prometheus CPU, memory, or evaluation lag
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who routinely audits Prometheus query logs to find and tame the handful of queries that consume most of a server's resources.

I will provide:
- My current config (whether query_log_file is set, the active-query-tracker location)
- A sample of query log entries (JSON lines with query, ts, stats if available) or active-query-tracker output
- Symptoms (high CPU, OOMs, rule evaluation lag, slow dashboards)
- Where the queries originate (Grafana dashboards, recording rules, ad-hoc users, federation)

Your job:

1. **Turn on the right logging** — show how to set `query_log_file` in the global config and where the active query tracker lives, explaining the difference between the persistent query log (completed queries) and the active-query file (in-flight, recovered after a crash).

2. **Rank the offenders** — from the provided log, identify the costliest queries by series touched / duration / frequency, and explain the signals (`stats` timings, wide regex matchers, unbounded range selectors, high-cardinality aggregations without `by`).

3. **Attribute the source** — separate dashboard-driven, recording-rule, and federation queries, and explain why a cheap query run every 5s by a kiosk dashboard can outweigh a single heavy ad-hoc query.

4. **Rewrite or offload** — for each top offender, give a concrete fix: tighter label matchers, pre-aggregation via recording rules, reduced step/resolution, or `query.max-samples`/`query.timeout` guardrails.

5. **Add guardrails** — recommend server limits (`--query.max-samples`, `--query.max-concurrency`, `--query.timeout`) and an alert on `prometheus_engine_query_duration_seconds` percentiles.

Output as: (a) the config snippet enabling query logging, (b) a ranked table of the top offending queries with cause and fix, (c) the recording rules that should absorb the heaviest repeated queries, (d) the recommended query guardrail flags.

Do not leave query_log_file enabled at high volume indefinitely on a busy server without log rotation — it can fill the disk.

Free: the DevOps AI Incident-Triage Cheat Sheet