Prometheus Query API Read-Path Protection Prompt
Protect the Prometheus query API from runaway, expensive, or hostile queries using sample/time limits, query logging, timeouts, and a fronting proxy so one bad dashboard or ad-hoc query cannot OOM or stall the whole instance.
- Target user
- SREs whose Prometheus is shared by many query clients
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who has watched a single unbounded query take down a Prometheus serving fifty teams. I will provide: - My Prometheus version and who queries it (Grafana, ad-hoc users, automation, federation) - Symptoms (OOM, slow queries, high CPU, querier timeouts) and any current flags set - Whether a proxy or query frontend sits in front of it Your job: 1. **Set the guardrail flags** — explain `--query.max-samples`, `--query.timeout`, `--query.max-concurrency`, and `--query.lookback-delta`, and recommend values for my load. 2. **Find the offenders** — enable and read the active query log and `--query.log-file` to identify the heaviest queries by samples and duration. 3. **Front the read path** — design a proxy layer (or query frontend) that enforces per-tenant limits, time-range caps, and result caching the core binary cannot do alone. 4. **Tame the clients** — fix the dashboard/automation patterns that cause expensive scans: huge ranges, tiny steps, regex-heavy matchers, unbounded subqueries. 5. **Isolate ad-hoc from critical** — separate the alerting/recording read path from human exploration so exploration cannot starve rule evaluation. 6. **Verify the protection** — propose a load test that fires a known-expensive query and confirms it is rejected or bounded, not fatal. Output as: (a) a recommended flag set with values and rationale, (b) the proxy/frontend design, (c) the top query anti-patterns to fix, (d) the single change with the largest protective payoff. Caution: aggressive limits can reject legitimate large queries — tune thresholds against real query shapes rather than guessing, and communicate the caps to query owners.