Reviewing AI-Generated Grafana Alert Rules Before They Go Live
Grafana's unified alerting hides real complexity behind a friendly UI. How I review AI-generated Grafana alert rules so they don't fire wrong or stay silent.
- #prometheus
- #grafana
- #alerting
- #ai
- #sre
Grafana unified alerting looks deceptively simple in the UI — a query, a threshold, a notification — but underneath it’s a multi-stage pipeline of query, reduce, threshold, and no-data handling that behaves nothing like a plain Prometheus alert rule. So when I started letting AI draft Grafana alerts as provisioning YAML, I needed a review process tuned to the specific ways those drafts go wrong. They’re not the same failures as raw Prometheus alerts. Grafana adds its own footguns, and the model treats them casually because it has seen far more Prometheus rules than Grafana provisioning files. Here’s the review I run before any AI-drafted Grafana alert reaches production.
Why Grafana alerts fail differently
A Prometheus alert is one expression with a for: duration. A Grafana alert is a chain: a query node produces a series, a reduce node collapses it to a single number, and a threshold node compares that number. AI frequently gets the expression right and the reduce/threshold chain wrong, producing an alert that compiles, provisions cleanly, and silently never fires. The UI hides this, and the YAML is verbose enough that errors hide in plain sight.
Check the data source UID is real
The most common provisioning failure is a hallucinated datasourceUid. The model invents a plausible UID, the file validates as YAML, and the alert imports with a dangling reference. I verify it against the actual data source first:
curl -s -H "Authorization: Bearer $TOKEN" \
localhost:3000/api/datasources | jq '.[] | {name, uid}'
Then I confirm every query node in the AI’s draft uses a UID from that list. A wrong UID is the Grafana equivalent of a hallucinated metric name — it looks complete and does nothing.
Read the reduce step, because that’s where it lies
This is the check that catches the most subtle bugs. A Grafana alert query usually returns a time series, but the threshold compares a scalar, so a reduce expression sits between them. AI loves to default to last(), which means a single late or null sample can flip the alert. For most real conditions I want mean or max over the window:
# AI default: one stale sample flips the alert
- refId: B
type: reduce
settings:
mode: dropNN
reducer: last
# Deliberate: average the window so a single blip can't trip it
- refId: B
type: reduce
reducer: mean
I make the model justify the reducer for the specific condition. “Why last and not mean?” If the answer is hand-wavy, it defaulted, and a default reducer in an alert is a future false page or a missed outage.
No-data and error handling is not optional
Raw Prometheus has absent(); Grafana has explicit noDataState and execErrState settings, and AI omits them constantly. The default behavior may not be what you want — a flapping target that produces no data could resolve your alert exactly when something is wrong. I require these to be set explicitly and deliberately:
noDataState: Alerting # treat missing data as a problem, not "all clear"
execErrState: Error
for: 5m
For a hard-down condition I want noDataState: Alerting. For a noisy optional exporter I might want OK. That’s a per-alert judgment, and the model won’t make it unless I force the question.
Pro Tip: Ask the AI to describe, in plain English, the exact sequence of states this alert moves through when the target goes down at 2am and comes back at 2:05. If its narration doesn’t match what you’d want on-call to experience, the no-data and for: settings are wrong — and that walkthrough surfaces it faster than reading the YAML.
Labels and contact-point routing
A correct Grafana alert that routes nowhere is useless. I verify the alert carries the labels our notification policy tree matches on — usually severity and team — because Grafana routes by label just like Alertmanager:
labels:
severity: warning
team: platform
annotations:
summary: "p95 latency high on {{ $labels.service }}"
runbook_url: "https://runbooks.internal/platform/latency"
AI omits routing labels unless prompted, so I template the requirement and check it every time. An alert with no team label falls through to the default policy and pages the wrong people.
Don’t let it duplicate your Prometheus rules
A real organizational failure: the model cheerfully recreates an alert in Grafana that already exists as a Prometheus rule, and now the same condition pages twice from two systems with slightly different thresholds. Before accepting a Grafana alert I ask whether this condition belongs in Grafana at all or whether it should live as a Prometheus rule evaluated by the server. My default is that paging alerts live in Prometheus/Alertmanager and Grafana alerting is for things genuinely tied to dashboard queries. The model has no opinion on this architecture; I do.
Provision it as code and review it as code
Because I keep these as provisioning files in Git, they go through normal code review, which means our code review dashboard catches the structural issues — dangling UIDs, missing for:, absent routing labels — before a human even reads the logic. The human review then focuses on the judgment calls: is the reducer right, is the no-data behavior right, does this belong in Grafana at all. That split keeps the review fast without making it shallow.
The mindset holds across both systems
Whether the model is drafting a Prometheus rule or a Grafana alert, it’s the same fast junior engineer who has read every doc and never carried a pager. It produces a strong, complete-looking draft instantly, and the draft is a hypothesis until a human can explain why the reducer, the threshold, and the no-data state are right for this condition. I draft these in Claude and refine them inline with Cursor, but the model is interchangeable — the review checklist is what makes the output safe. For the alerts that should live in Prometheus instead, the Alert Rule Generator gives me a draft that already carries for:, severity, and runbook annotations.
Conclusion
Grafana unified alerting fails in its own specific ways — hallucinated data source UIDs, lazy reducers, missing no-data handling, duplicated Prometheus rules — and AI hits all of them because it’s seen far more raw Prometheus than Grafana provisioning. Verify the UID, scrutinize the reduce step, force explicit no-data and routing decisions, and confirm the alert belongs in Grafana at all. More dashboard and alerting patterns are in the monitoring guides, and reusable review prompts live in the prompts library.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.