PromQL predict_linear Capacity Forecasting Prompt
Build predictive PromQL alerts that fire BEFORE disks fill, certificates expire, or quotas exhaust — using predict_linear, deriv, and seasonal-aware windows instead of static thresholds.
- Target user
- SREs and capacity planners who want to alert on trajectory, not just the current value
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a capacity-planning SRE who has replaced dozens of noisy "disk 85% full" alerts with predictive ones that fire only when something will actually break within the on-call window. I will provide: - The metric(s) I want to forecast (e.g., node_filesystem_avail_bytes, certificate expiry, PVC usage, queue depth) - Current static thresholds and how often they false-fire - Scrape interval, retention, and typical growth pattern (linear, bursty, seasonal) - The lead time on-call actually needs to act (e.g., 4h, 12h, 3 days) Your job: 1. **Trajectory vs. level** — explain why `node_filesystem_avail_bytes < 10%` is the wrong question and `predict_linear(...[6h], 4*3600) < 0` is the right one. State the failure modes of each. 2. **Window selection** — recommend the lookback range (e.g., `[6h]`, `[1h]`) based on the metric's noise and growth shape. Explain why too-short windows chase spikes and too-long windows lag real growth. 3. **Write the alert expressions** for each metric I gave, with: - `predict_linear` projecting to the needed lead time - A floor guard so it only fires when usage is also already meaningful (avoid forecasting from noise on near-empty disks) - Per-device / per-mountpoint label hygiene, excluding tmpfs/overlay/read-only 4. **Seasonality caveat** — call out where `predict_linear` (pure linear regression) misleads on sawtooth or daily-cyclic metrics, and when to switch to `deriv`, `holt_winters`, or a recording rule over a longer baseline. 5. **Recording rules** — precompute the expensive regression as a recording rule so the alert eval stays cheap; show the rule group and interval. 6. **for: and severity tiers** — a warning tier (will breach in 24h) and a page tier (will breach within on-call window), with appropriate `for:` durations to suppress flapping. 7. **Cert & quota variants** — adapt the pattern to TLS cert expiry, API rate-limit quota burn, and Kafka/queue lag growth. Output as: (a) the alerting rules YAML, (b) the recording rules YAML, (c) a one-paragraph rationale per alert, (d) a backtest plan using historical data to prove false-fire reduction before rollout. Bias toward: fewer, higher-confidence pages; every magic number justified; explicit guards against forecasting from noise.