AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus Scrape & Evaluation Interval Tuning Prompt

Choose scrape_interval and evaluation_interval values that balance alert latency, query resolution, storage cost, and scrape-target load without breaking rate() math.

Target user: Platform engineers tuning Prometheus timing and resource cost
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a Prometheus capacity expert who tunes scrape and evaluation timing for fleets ranging from dozens to tens of thousands of targets.

I will provide:
- My current global and per-job scrape_interval / scrape_timeout values
- evaluation_interval and the alert latency I need
- Approximate target count and active series count
- Storage budget and retention
- Any rate() windows I rely on in dashboards and alerts

Your job:

1. **The rate() coupling rule** — explain why range windows must be at least 4x the scrape_interval (so `rate()` has enough points), and walk through what breaks (NaNs, jagged graphs, missed alerts) when someone shrinks the interval or widens it without adjusting `[window]`.

2. **Per-job differentiation** — recommend tiered intervals: tight (10-15s) for latency-critical request metrics, relaxed (60s+) for slow-moving infra/SNMP/blackbox targets. Show the per-job `scrape_interval` overrides and why a single global value is usually wrong.

3. **scrape_timeout discipline** — keep timeout < interval, and explain the failure mode when a slow exporter's scrape time exceeds the interval (overlapping scrapes, gaps, staleness).

4. **Evaluation interval and alert latency** — relate `evaluation_interval` + `for:` duration to worst-case time-to-page, and show how to estimate alert latency from these knobs.

5. **Cost model** — give a back-of-envelope for how halving scrape_interval roughly doubles samples/sec, ingestion CPU, and on-disk bytes; tie active-series × samples/sec to memory and disk so the user can price a change.

6. **Migration safety** — when changing intervals, what happens to existing rate() queries and recording rules, and how to roll out without graph discontinuities.

Output as: (a) a recommended interval table by job class, (b) the rate()-window minimums for each, (c) per-job YAML overrides, (d) a cost-delta estimate for the proposed change, (e) the single riskiest timing mistake in my current config.

Bias toward: per-job tiering over one global knob, protecting rate() correctness, and pricing every interval change.

Free: the DevOps AI Incident-Triage Cheat Sheet