Grafana Error Guide: 'failed to evaluate rule' — fixing unified alerting rule Error state
Fix 'failed to evaluate rule' in Grafana unified alerting — check datasource UID, query timeouts, NoData/Error handling, expressions and evaluation_timeout.
- #grafana
- #troubleshooting
- #errors
- #unified-alerting
- #ngalert
Overview
Grafana’s unified alerting engine (ngalert) evaluates each alert rule on a schedule. When the evaluation cannot complete — because the query fails, the datasource is missing, or an expression is invalid — the rule transitions into the Error state instead of producing a normal Alerting/Normal result. In the Grafana UI the rule health shows error, and the server logs the failure.
failed to evaluate rule
level=error msg="Failed to evaluate rule" rule_uid=ceval01 org_id=1 err="failed to build query 'A': data source not found"
[sse.dataQueryError] failed to execute query 'A': tsdb.HandleRequest() error context deadline exceeded
input data must be a wide series but got type long
Understanding the difference between the Error state and the NoData state is key: NoData means the query ran successfully but returned zero frames, while Error means the evaluation itself failed. Both are configurable per rule, but they have different root causes.
Symptoms
- Alert rule shows a red Error health indicator in Alerting → Alert rules.
- The rule’s state history shows repeated
Errortransitions rather thanNormal/Alerting. journalctl -u grafana-serverprintsFailed to evaluate rule.- A
DatasourceErroralert may fire if you enabled “Alert state if execution error or timeout”. - Panels using the same query may render fine (dashboard queries and alert queries are executed by different code paths and timeouts).
Common Root Causes
1. Datasource deleted or wrong UID
Alert rules reference datasources by UID, not by name. If a datasource is recreated or provisioned with a new UID, the rule can no longer resolve it.
# /etc/grafana/provisioning/alerting/cpu-rule.yaml
apiVersion: 1
groups:
- orgId: 1
name: infra
folder: Infra
interval: 1m
rules:
- uid: ceval01
title: High CPU
condition: C
data:
- refId: A
datasourceUid: prometheus-prod # must match an existing DS UID
model:
expr: node_load1
err="failed to build query 'A': data source not found"
Confirm the UID exists via GET /api/datasources and match it against the rule.
2. Query timeout / context deadline exceeded
Slow queries hit the evaluation timeout. The default evaluation_timeout is 30s; heavy PromQL/LogQL over long ranges will exceed it.
# /etc/grafana/grafana.ini
[unified_alerting]
evaluation_timeout = 30s
max_attempts = 1
min_interval = 10s
[sse.dataQueryError] failed to execute query 'A': tsdb.HandleRequest() error context deadline exceeded
3. Expression on the wrong data shape
Reduce and Math expressions require a wide (labeled instant) series. Feeding a long (raw time series/range) frame into a Reduce produces a type error.
input data must be a wide series but got type long
Fix: set the query to instant (Prometheus) or add a Reduce step before Math so the multi-dimensional data collapses to one value per series.
4. Datasource plugin panic or backend down
If the datasource backend is unreachable (Prometheus/Loki pod restarting, network policy), the query layer surfaces an execution error.
level=error msg="Failed to evaluate rule" err="[sse.dataQueryError] failed to execute query 'A': Get \"http://prometheus:9090/api/v1/query\": dial tcp: connection refused"
Diagnostic Workflow
Step 1 — Read the rule health and error via the API.
# Grafana-managed rules with health/state
curl -s -u admin:$GRAFANA_PW \
http://localhost:3000/api/prometheus/grafana/api/v1/rules \
| jq '.data.groups[].rules[] | select(.health=="error") | {name, health, lastError}'
Step 2 — Inspect the provisioned rule definition.
curl -s -u admin:$GRAFANA_PW \
http://localhost:3000/api/v1/provisioning/alert-rules | jq '.[] | {uid, title, condition}'
Step 3 — Grep the server logs for the evaluation error.
journalctl -u grafana-server --since "15 min ago" | grep -iE "eval|Failed to evaluate"
# In Kubernetes:
kubectl logs deploy/grafana -n monitoring | grep -i "evaluate rule"
Step 4 — Verify the datasource UID resolves.
curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/datasources \
| jq '.[] | {name, uid, type}'
Step 5 — Confirm evaluation timeout settings.
[unified_alerting]
evaluation_timeout = 60s
max_attempts = 3
Example Root Cause Analysis
An on-call engineer noticed the High CPU rule showed Error for 40 minutes but no incident was raised. The dashboard panel with the identical node_load1 query rendered normally, which was the red herring.
Running the rules API returned:
"lastError": "failed to build query 'A': data source not found"
The team had re-provisioned Prometheus that morning through a new Helm release, which regenerated the datasource with UID prometheus-6f2a instead of the old prometheus-prod. The dashboard used a variable that resolved to the current default datasource, so it kept working — but the alert rule stored the stale UID in its JSON model.
Fix: they updated datasourceUid in the provisioning file to a fixed, pinned UID (uid: prometheus-prod set explicitly in the datasource provisioning too), redeployed, and the rule returned to ok. They added a CI check asserting datasource UIDs are stable across releases.
Prevention Best Practices
- Pin datasource UIDs explicitly in
datasources.yamlprovisioning so they survive redeploys. - Set
evaluation_timeouthigher than your slowest alert query and keepmin_intervalsane. - Use instant queries plus an explicit Reduce for threshold rules to avoid long/wide type errors.
- Configure per-rule Error handling deliberately: choose
Error,Alerting, orOKon execution error rather than leaving defaults. - Alert on
Grafanafolder health: create a meta-alert onALERTS{alertstate="firing", alertname="DatasourceError"}. - Version-control all
/etc/grafana/provisioning/alerting/files and review UID changes in PRs.
See more Grafana troubleshooting in /categories/grafana/ and the sibling guide on contact point send failures.
Quick Command Reference
# Find rules currently in error state
curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/prometheus/grafana/api/v1/rules \
| jq '.data.groups[].rules[] | select(.health=="error") | {name, lastError}'
# Dump provisioned alert rules
curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/v1/provisioning/alert-rules | jq '.[].uid'
# List datasource UIDs
curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/datasources | jq '.[] | {name, uid}'
# Tail evaluation errors
journalctl -u grafana-server -f | grep -i "evaluate rule"
kubectl logs -f deploy/grafana -n monitoring | grep -i "evaluate rule"
Conclusion
The top root causes of failed to evaluate rule in Grafana unified alerting:
- Datasource not found — the rule references a UID that no longer exists after a redeploy.
- Query timeout — heavy queries exceed
[unified_alerting] evaluation_timeout(context deadline exceeded). - Wrong data shape — feeding a long/range frame into a Reduce/Math expression (
must be a wide series but got type long). - Datasource backend down or panicking — the underlying Prometheus/Loki endpoint is unreachable.
- Misconfigured error handling — leaving default Error-state behavior so failures go unnoticed instead of paging.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.