AI Prompt Templates for Prometheus Alerting
Production-ready prompt templates for generating Prometheus alert rules with proper thresholds, runbook annotations, and false-positive analysis.
- #prometheus
- #alerting
- #promql
- #ai
- #sre
Writing good Prometheus alerts is hard. Most alerts are too sensitive (page on every blip), too lax (miss real outages), or missing context (no runbook, no labels, no severity routing). AI assistants are unusually good at the grunt work of alert authoring — if you prompt them right.
Why generic alert generators fail
Type “write me a Prometheus alert for high CPU” into any AI and you’ll get:
- alert: HighCPU
expr: cpu_usage > 80
for: 5m
Three things wrong already: cpu_usage isn’t a real Prometheus metric, there’s no rate() window, and for: 5m will flap on every cron job. You need a prompt that anchors the model in production reality.
The template structure
Our Prometheus Alert Rule Generator Prompt enforces:
- Resilient PromQL —
rate(),avg_over_time, orhistogram_quantile()as appropriate. - Appropriate
for:duration — long enough to avoid flap, short enough to detect real outages. - Severity labels and routing —
severity,team,service. - Runbook annotation — every alert links to a runbook.
- False-positive analysis — the model lists ways the alert could lie.
Three patterns worth saving
Pattern 1: Rate-based error alerts
Alert me when the 5-minute error rate exceeds 1% for at least 10 minutes, scoped per service.
Generated PromQL pattern:
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
> 0.01
for: 10m
Pattern 2: SLO-based latency
Alert when p99 latency exceeds my SLO threshold for 10 minutes.
expr: |
histogram_quantile(0.99,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
) > 0.8
for: 10m
Pattern 3: Saturation alerts
Alert when disk on any node will run out in < 4 hours based on current growth rate.
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 4*3600) < 0
for: 30m
labels:
severity: warning
The predict_linear pattern is particularly nice — it pages you before the disk fills, not at 100%.
Validation: don’t trust, verify
Before promoting any AI-generated alert to prod:
promtool check rules my-alerts.yml
Run it in your staging Prometheus first. Watch it for 24 hours. Check if it would have fired during recent incidents using promtool test rules.
Combining alert generation with runbook drafting
A workflow that compounds: ask the same AI to also draft the runbook for the alert it generated. “Now write a runbook for this alert: what should the on-call check first, what are the common causes, and what’s the rollback procedure?”
You’ll have an alert and a runbook in 5 minutes. Both still need human review — but the blank page is gone.