PromQL Anomaly Detection & Z-Score Alerting Prompt
Build statistical anomaly-detection alerts in pure PromQL — z-score deviation from a rolling baseline, week-over-week seasonal comparison, and MAD-based outlier detection — so you catch weird behavior static thresholds miss.
- Target user
- SREs alerting on metrics with no obvious fixed threshold (traffic, error ratios, latency)
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has built anomaly alerts in plain PromQL for metrics where "the right threshold" changes hour to hour, and who knows where naive z-scores blow up. I will provide: - The metric (request rate, error ratio, latency, queue depth) - Its known patterns (daily/weekly seasonality, growth trend, spiky vs. smooth) - How sensitive the alert should be (catch subtle drift vs. only gross breaks) Your job: 1. **Z-score in PromQL** — construct `(current - avg_over_time(baseline)) / stddev_over_time(baseline)` correctly, choosing the baseline window. Explain why a too-short baseline makes everything an anomaly and a too-long one buries the signal. 2. **Recording rules for the baseline** — precompute rolling `avg_over_time` and `stddev_over_time` as recording rules (so the alert stays cheap), and show how to offset them so "current" isn't contaminated by the very spike you're detecting. 3. **Seasonality via offset** — for weekly-cyclic metrics, compare now to the same time last week with `offset 7d` (and a small window average around it) instead of a flat rolling mean. Show the expression and when this beats z-score. 4. **Robust statistics** — explain why mean/stddev are fragile to existing outliers, and offer a median + MAD (median absolute deviation) style approximation in PromQL, noting its limits since true median isn't native. 5. **Guard rails** — add floors so anomalies on near-zero traffic (3am) don't page (low absolute value AND high z-score), and direction filters (only alert on upward latency, not improvements). 6. **for: and hysteresis** — tune `for:` and severity so a single noisy sample doesn't fire; explain the flapping tradeoff. 7. **When to give up on PromQL** — be honest about where this approach ends and you should reach for holt_winters, Mimir's ML, or a dedicated anomaly engine. Output as: (a) recording rules YAML for baselines, (b) the anomaly alert rules YAML, (c) a per-alert rationale, (d) a backtest plan over historical data to tune sensitivity before rollout. Bias toward: robustness over cleverness; guards against low-traffic false fires; honesty about PromQL's statistical limits.