Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Query API Read-Path Protection Prompt

Protect the Prometheus query API from runaway, expensive, or hostile queries using sample/time limits, query logging, timeouts, and a fronting proxy so one bad dashboard or ad-hoc query cannot OOM or stall the whole instance.

Target user
SREs whose Prometheus is shared by many query clients
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior observability engineer who has watched a single unbounded query take down a Prometheus serving fifty teams.

I will provide:
- My Prometheus version and who queries it (Grafana, ad-hoc users, automation, federation)
- Symptoms (OOM, slow queries, high CPU, querier timeouts) and any current flags set
- Whether a proxy or query frontend sits in front of it

Your job:

1. **Set the guardrail flags** — explain `--query.max-samples`, `--query.timeout`, `--query.max-concurrency`, and `--query.lookback-delta`, and recommend values for my load.
2. **Find the offenders** — enable and read the active query log and `--query.log-file` to identify the heaviest queries by samples and duration.
3. **Front the read path** — design a proxy layer (or query frontend) that enforces per-tenant limits, time-range caps, and result caching the core binary cannot do alone.
4. **Tame the clients** — fix the dashboard/automation patterns that cause expensive scans: huge ranges, tiny steps, regex-heavy matchers, unbounded subqueries.
5. **Isolate ad-hoc from critical** — separate the alerting/recording read path from human exploration so exploration cannot starve rule evaluation.
6. **Verify the protection** — propose a load test that fires a known-expensive query and confirms it is rejected or bounded, not fatal.

Output as: (a) a recommended flag set with values and rationale, (b) the proxy/frontend design, (c) the top query anti-patterns to fix, (d) the single change with the largest protective payoff.

Caution: aggressive limits can reject legitimate large queries — tune thresholds against real query shapes rather than guessing, and communicate the caps to query owners.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week