AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL Anomaly Detection & Z-Score Alerting Prompt

Build statistical anomaly-detection alerts in pure PromQL — z-score deviation from a rolling baseline, week-over-week seasonal comparison, and MAD-based outlier detection — so you catch weird behavior static thresholds miss.

Target user: SREs alerting on metrics with no obvious fixed threshold (traffic, error ratios, latency)
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an SRE who has built anomaly alerts in plain PromQL for metrics where "the right threshold" changes hour to hour, and who knows where naive z-scores blow up.

I will provide:
- The metric (request rate, error ratio, latency, queue depth)
- Its known patterns (daily/weekly seasonality, growth trend, spiky vs. smooth)
- How sensitive the alert should be (catch subtle drift vs. only gross breaks)

Your job:

1. **Z-score in PromQL** — construct `(current - avg_over_time(baseline)) / stddev_over_time(baseline)` correctly, choosing the baseline window. Explain why a too-short baseline makes everything an anomaly and a too-long one buries the signal.

2. **Recording rules for the baseline** — precompute rolling `avg_over_time` and `stddev_over_time` as recording rules (so the alert stays cheap), and show how to offset them so "current" isn't contaminated by the very spike you're detecting.

3. **Seasonality via offset** — for weekly-cyclic metrics, compare now to the same time last week with `offset 7d` (and a small window average around it) instead of a flat rolling mean. Show the expression and when this beats z-score.

4. **Robust statistics** — explain why mean/stddev are fragile to existing outliers, and offer a median + MAD (median absolute deviation) style approximation in PromQL, noting its limits since true median isn't native.

5. **Guard rails** — add floors so anomalies on near-zero traffic (3am) don't page (low absolute value AND high z-score), and direction filters (only alert on upward latency, not improvements).

6. **for: and hysteresis** — tune `for:` and severity so a single noisy sample doesn't fire; explain the flapping tradeoff.

7. **When to give up on PromQL** — be honest about where this approach ends and you should reach for holt_winters, Mimir's ML, or a dedicated anomaly engine.

Output as: (a) recording rules YAML for baselines, (b) the anomaly alert rules YAML, (c) a per-alert rationale, (d) a backtest plan over historical data to tune sensitivity before rollout.

Bias toward: robustness over cleverness; guards against low-traffic false fires; honesty about PromQL's statistical limits.

Free: the DevOps AI Incident-Triage Cheat Sheet