How to Review AI-Generated Prometheus Alert Rules Before

An alert rule is code that runs unsupervised, at 3am, and wakes up a human. That’s a strange and high-stakes kind of artifact to let an AI write — and yet AI is genuinely great at writing it, because most of an alert rule is boilerplate the model has seen ten thousand times. The risk isn’t that AI produces garbage. It’s that AI produces something that looks exactly like a good alert and is subtly wrong in a way you only discover during the incident it was supposed to catch. So I’ve built a review ritual. Every AI-generated alert clears it before it merges. Here’s the checklist, with the failures it’s caught.

The mindset: AI drafts, humans approve

I think of the model as a fast junior engineer who has read every alerting blog post but has never been paged. It produces a strong first draft instantly. It has no idea whether for: 5m is right for your deploy cadence or whether the metric it referenced actually exists in your cluster. The draft is a hypothesis. Review is where it becomes safe to deploy. If a rule can’t survive the checklist below, it doesn’t merge — no exceptions, no “looks fine to me.”

Check 1: does the metric actually exist?

The single most common failure. AI confidently invents plausible metric names. cpu_usage, memory_percent, disk_full — none of these are real Prometheus conventions, but they appear constantly. Before anything else I check the referenced series exists:

# Run this in the expression browser — does it return data?
count(node_memory_MemAvailable_bytes)

If it returns nothing, the rule will never fire and the model hallucinated the metric. I ask the model to justify the name against node_exporter or kube-state-metrics conventions, and I confirm against my real metrics. A rule on a non-existent metric is worse than no rule, because it creates false confidence.

Check 2: is the PromQL resilient to restarts and gaps?

A counter that resets on pod restart will spike a naive delta() query. A gauge that goes stale during a scrape failure will trip an == 0 comparison. I read the expression asking “what happens when the target disappears?” The fix is usually rate() over a window or an explicit absent() guard:

# Fragile: fires spuriously on restart
- alert: "RequestsDropped"
  expr: 'delta(http_requests_total[5m]) < 0'

# Resilient: rate handles counter resets correctly
- alert: "RequestRateCollapsed"
  expr: 'rate(http_requests_total[5m]) < 0.1'
  for: 10m

AI often defaults to the fragile form because it’s shorter. I make it explain how the query behaves across a counter reset, and if the explanation is vague, I rewrite it.

Check 3: is the `for:` duration deliberate?

for: is where AI is laziest, almost always defaulting to 5m regardless of context. A 5-minute window is wrong for a batch job that runs every ten minutes, wrong for a flappy metric, and wrong for a hard-down condition where you want to page in thirty seconds. I ask: what’s the shortest real outage we must catch, and what’s the longest blip we must ignore? The for: lives between those two numbers, and that’s a human judgment about this service, not a default.

Pro Tip: For any AI-drafted alert, ask the model to list three ways the alert could fire when nothing is actually wrong, and three ways a real problem could happen without firing. If it can’t produce both lists convincingly, the rule isn’t ready — and the lists often reveal a missing label scope or a bad window.

Check 4: severity, routing labels, and runbook

A correct alert that pages the wrong team is still a failure. I verify the severity label matches our Alertmanager routing tree, the team/service labels exist so routing works, and every alert carries a runbook_url. AI happily omits all of these unless prompted, so I template the requirement:

- alert: "ServiceLatencyP95High"
  expr: 'service:request_latency:p95_5m > 0.5'
  for: 10m
  labels:
    severity: warning
    team: payments
  annotations:
    summary: "p95 latency above 500ms for {{ $labels.service }}"
    runbook_url: "https://runbooks.internal/payments/latency"

An alert without a runbook is an alert that turns into a frantic guessing game at 3am. That’s a hard gate in my review.

Check 5: can a human explain why it’s correct?

This is the meta-check. For every rule, I should be able to say in one sentence why the threshold, the window, and the labels are right for this service. If the only justification is “the AI wrote it,” the rule fails. Explainability is the whole point of keeping a human in the loop — not ceremony, but the thing that catches the subtle errors automation can’t see in itself.

Check 6: will it survive aggregation across instances?

A failure I see constantly in AI-generated alerts is a comparison that’s correct for a single instance but wrong once you have ten. The model writes node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1 and it reads fine — until you realize it fires per-mountpoint-per-instance and your Alertmanager grouping wasn’t built for that fan-out, so one bad NFS mount pages everyone. Or worse, the model aggregates with sum() where it should use max(), hiding a single saturated instance behind a healthy average. I read every alert asking “what does this do when there are fifty matching series, and is that the cardinality I want paging me?”

# Misleading: a single hot CPU disappears into the average
- alert: "CpuHigh"
  expr: 'avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.9'

# Honest: catches any core pegged, scoped sensibly
- alert: "CpuSaturated"
  expr: 'max by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.9'
  for: 10m

I make the model state explicitly how many series its expression returns and what each one represents. If it can’t, the aggregation isn’t deliberate, and a non-deliberate aggregation in an alert is a future false-negative or a paging storm waiting to happen.

Build the checklist into your team’s habit

A checklist only works if it’s actually run, so I’ve made it lightweight enough to apply every time and visible enough that reviewers expect it. The six checks fit on a sticky note: metric exists, PromQL survives restarts, for: is deliberate, labels and runbook present, aggregation is intentional, and a human can explain it in a sentence. For the team, I keep these as PR review criteria so an AI-generated alert can’t merge without someone confirming them. The point isn’t bureaucracy — it’s that the failures this catches are precisely the ones that don’t show up until the incident the alert was supposed to handle, when it’s far too late to discover the metric name was hallucinated.

Tooling that makes the ritual faster

I run alert drafts through the free Alert Rule Generator, which already enforces for: durations, severity labels, and runbook annotations, so the draft arrives most of the way through the checklist. For the resilience and metric-existence checks, our code review dashboard is useful when alert rules live in a Git repo alongside everything else. I’ve drafted rules in Claude and ChatGPT and reviewed them inline with Cursor — the model varies, the checklist doesn’t.

Conclusion

AI-generated alert rules are a genuine productivity win, but only because the review step is non-negotiable. Verify the metric exists, confirm the PromQL survives restarts, make the for: deliberate, enforce labels and runbooks, and refuse anything you can’t explain in a sentence. The model is a fast junior engineer; you’re the one who’s been paged. Keep that division of labor and AI makes your alerting better instead of louder. More patterns in designing alert rules that don’t page you falsely and across the monitoring guides.

How to Review AI-Generated Prometheus Alert Rules Before They Page

The mindset: AI drafts, humans approve

Check 1: does the metric actually exist?

Check 2: is the PromQL resilient to restarts and gaps?

Check 3: is the `for:` duration deliberate?

Check 4: severity, routing labels, and runbook

Check 5: can a human explain why it’s correct?

Check 6: will it survive aggregation across instances?

Build the checklist into your team’s habit

Tooling that makes the ritual faster

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit

The mindset: AI drafts, humans approve

Check 1: does the metric actually exist?

Check 2: is the PromQL resilient to restarts and gaps?

Check 3: is the for: duration deliberate?

Check 4: severity, routing labels, and runbook

Check 5: can a human explain why it’s correct?

Check 6: will it survive aggregation across instances?

Build the checklist into your team’s habit

Tooling that makes the ritual faster

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit

Check 3: is the `for:` duration deliberate?