Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

PromQL Anomaly Detection & Z-Score Alerting Prompt

Build statistical anomaly-detection alerts in pure PromQL — z-score deviation from a rolling baseline, week-over-week seasonal comparison, and MAD-based outlier detection — so you catch weird behavior static thresholds miss.

Target user
SREs alerting on metrics with no obvious fixed threshold (traffic, error ratios, latency)
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are an SRE who has built anomaly alerts in plain PromQL for metrics where "the right threshold" changes hour to hour, and who knows where naive z-scores blow up.

I will provide:
- The metric (request rate, error ratio, latency, queue depth)
- Its known patterns (daily/weekly seasonality, growth trend, spiky vs. smooth)
- How sensitive the alert should be (catch subtle drift vs. only gross breaks)

Your job:

1. **Z-score in PromQL** — construct `(current - avg_over_time(baseline)) / stddev_over_time(baseline)` correctly, choosing the baseline window. Explain why a too-short baseline makes everything an anomaly and a too-long one buries the signal.

2. **Recording rules for the baseline** — precompute rolling `avg_over_time` and `stddev_over_time` as recording rules (so the alert stays cheap), and show how to offset them so "current" isn't contaminated by the very spike you're detecting.

3. **Seasonality via offset** — for weekly-cyclic metrics, compare now to the same time last week with `offset 7d` (and a small window average around it) instead of a flat rolling mean. Show the expression and when this beats z-score.

4. **Robust statistics** — explain why mean/stddev are fragile to existing outliers, and offer a median + MAD (median absolute deviation) style approximation in PromQL, noting its limits since true median isn't native.

5. **Guard rails** — add floors so anomalies on near-zero traffic (3am) don't page (low absolute value AND high z-score), and direction filters (only alert on upward latency, not improvements).

6. **for: and hysteresis** — tune `for:` and severity so a single noisy sample doesn't fire; explain the flapping tradeoff.

7. **When to give up on PromQL** — be honest about where this approach ends and you should reach for holt_winters, Mimir's ML, or a dedicated anomaly engine.

Output as: (a) recording rules YAML for baselines, (b) the anomaly alert rules YAML, (c) a per-alert rationale, (d) a backtest plan over historical data to tune sensitivity before rollout.

Bias toward: robustness over cleverness; guards against low-traffic false fires; honesty about PromQL's statistical limits.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week