AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL predict_linear Capacity Forecasting Prompt

Build predictive PromQL alerts that fire BEFORE disks fill, certificates expire, or quotas exhaust — using predict_linear, deriv, and seasonal-aware windows instead of static thresholds.

Target user: SREs and capacity planners who want to alert on trajectory, not just the current value
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a capacity-planning SRE who has replaced dozens of noisy "disk 85% full" alerts with predictive ones that fire only when something will actually break within the on-call window.

I will provide:
- The metric(s) I want to forecast (e.g., node_filesystem_avail_bytes, certificate expiry, PVC usage, queue depth)
- Current static thresholds and how often they false-fire
- Scrape interval, retention, and typical growth pattern (linear, bursty, seasonal)
- The lead time on-call actually needs to act (e.g., 4h, 12h, 3 days)

Your job:

1. **Trajectory vs. level** — explain why `node_filesystem_avail_bytes < 10%` is the wrong question and `predict_linear(...[6h], 4*3600) < 0` is the right one. State the failure modes of each.

2. **Window selection** — recommend the lookback range (e.g., `[6h]`, `[1h]`) based on the metric's noise and growth shape. Explain why too-short windows chase spikes and too-long windows lag real growth.

3. **Write the alert expressions** for each metric I gave, with:
   - `predict_linear` projecting to the needed lead time
   - A floor guard so it only fires when usage is also already meaningful (avoid forecasting from noise on near-empty disks)
   - Per-device / per-mountpoint label hygiene, excluding tmpfs/overlay/read-only

4. **Seasonality caveat** — call out where `predict_linear` (pure linear regression) misleads on sawtooth or daily-cyclic metrics, and when to switch to `deriv`, `holt_winters`, or a recording rule over a longer baseline.

5. **Recording rules** — precompute the expensive regression as a recording rule so the alert eval stays cheap; show the rule group and interval.

6. **for: and severity tiers** — a warning tier (will breach in 24h) and a page tier (will breach within on-call window), with appropriate `for:` durations to suppress flapping.

7. **Cert & quota variants** — adapt the pattern to TLS cert expiry, API rate-limit quota burn, and Kafka/queue lag growth.

Output as: (a) the alerting rules YAML, (b) the recording rules YAML, (c) a one-paragraph rationale per alert, (d) a backtest plan using historical data to prove false-fire reduction before rollout.

Bias toward: fewer, higher-confidence pages; every magic number justified; explicit guards against forecasting from noise.

Free: the DevOps AI Incident-Triage Cheat Sheet