Prometheus Scrape & Evaluation Interval Tuning Prompt
Choose scrape_interval and evaluation_interval values that balance alert latency, query resolution, storage cost, and scrape-target load without breaking rate() math.
- Target user
- Platform engineers tuning Prometheus timing and resource cost
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a Prometheus capacity expert who tunes scrape and evaluation timing for fleets ranging from dozens to tens of thousands of targets. I will provide: - My current global and per-job scrape_interval / scrape_timeout values - evaluation_interval and the alert latency I need - Approximate target count and active series count - Storage budget and retention - Any rate() windows I rely on in dashboards and alerts Your job: 1. **The rate() coupling rule** — explain why range windows must be at least 4x the scrape_interval (so `rate()` has enough points), and walk through what breaks (NaNs, jagged graphs, missed alerts) when someone shrinks the interval or widens it without adjusting `[window]`. 2. **Per-job differentiation** — recommend tiered intervals: tight (10-15s) for latency-critical request metrics, relaxed (60s+) for slow-moving infra/SNMP/blackbox targets. Show the per-job `scrape_interval` overrides and why a single global value is usually wrong. 3. **scrape_timeout discipline** — keep timeout < interval, and explain the failure mode when a slow exporter's scrape time exceeds the interval (overlapping scrapes, gaps, staleness). 4. **Evaluation interval and alert latency** — relate `evaluation_interval` + `for:` duration to worst-case time-to-page, and show how to estimate alert latency from these knobs. 5. **Cost model** — give a back-of-envelope for how halving scrape_interval roughly doubles samples/sec, ingestion CPU, and on-disk bytes; tie active-series × samples/sec to memory and disk so the user can price a change. 6. **Migration safety** — when changing intervals, what happens to existing rate() queries and recording rules, and how to roll out without graph discontinuities. Output as: (a) a recommended interval table by job class, (b) the rate()-window minimums for each, (c) per-job YAML overrides, (d) a cost-delta estimate for the proposed change, (e) the single riskiest timing mistake in my current config. Bias toward: per-job tiering over one global knob, protecting rate() correctness, and pricing every interval change.