Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Grafana By James Joyner IV · · 10 min read

Grafana Error Guide: 'failed to evaluate rule' — fixing unified alerting rule Error state

Fix 'failed to evaluate rule' in Grafana unified alerting — check datasource UID, query timeouts, NoData/Error handling, expressions and evaluation_timeout.

  • #grafana
  • #troubleshooting
  • #errors
  • #unified-alerting
  • #ngalert

Overview

Grafana’s unified alerting engine (ngalert) evaluates each alert rule on a schedule. When the evaluation cannot complete — because the query fails, the datasource is missing, or an expression is invalid — the rule transitions into the Error state instead of producing a normal Alerting/Normal result. In the Grafana UI the rule health shows error, and the server logs the failure.

failed to evaluate rule
level=error msg="Failed to evaluate rule" rule_uid=ceval01 org_id=1 err="failed to build query 'A': data source not found"
[sse.dataQueryError] failed to execute query 'A': tsdb.HandleRequest() error context deadline exceeded
input data must be a wide series but got type long

Understanding the difference between the Error state and the NoData state is key: NoData means the query ran successfully but returned zero frames, while Error means the evaluation itself failed. Both are configurable per rule, but they have different root causes.

Symptoms

  • Alert rule shows a red Error health indicator in Alerting → Alert rules.
  • The rule’s state history shows repeated Error transitions rather than Normal/Alerting.
  • journalctl -u grafana-server prints Failed to evaluate rule.
  • A DatasourceError alert may fire if you enabled “Alert state if execution error or timeout”.
  • Panels using the same query may render fine (dashboard queries and alert queries are executed by different code paths and timeouts).

Common Root Causes

1. Datasource deleted or wrong UID

Alert rules reference datasources by UID, not by name. If a datasource is recreated or provisioned with a new UID, the rule can no longer resolve it.

# /etc/grafana/provisioning/alerting/cpu-rule.yaml
apiVersion: 1
groups:
  - orgId: 1
    name: infra
    folder: Infra
    interval: 1m
    rules:
      - uid: ceval01
        title: High CPU
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus-prod   # must match an existing DS UID
            model:
              expr: node_load1
err="failed to build query 'A': data source not found"

Confirm the UID exists via GET /api/datasources and match it against the rule.

2. Query timeout / context deadline exceeded

Slow queries hit the evaluation timeout. The default evaluation_timeout is 30s; heavy PromQL/LogQL over long ranges will exceed it.

# /etc/grafana/grafana.ini
[unified_alerting]
evaluation_timeout = 30s
max_attempts = 1
min_interval = 10s
[sse.dataQueryError] failed to execute query 'A': tsdb.HandleRequest() error context deadline exceeded

3. Expression on the wrong data shape

Reduce and Math expressions require a wide (labeled instant) series. Feeding a long (raw time series/range) frame into a Reduce produces a type error.

input data must be a wide series but got type long

Fix: set the query to instant (Prometheus) or add a Reduce step before Math so the multi-dimensional data collapses to one value per series.

4. Datasource plugin panic or backend down

If the datasource backend is unreachable (Prometheus/Loki pod restarting, network policy), the query layer surfaces an execution error.

level=error msg="Failed to evaluate rule" err="[sse.dataQueryError] failed to execute query 'A': Get \"http://prometheus:9090/api/v1/query\": dial tcp: connection refused"

Diagnostic Workflow

Step 1 — Read the rule health and error via the API.

# Grafana-managed rules with health/state
curl -s -u admin:$GRAFANA_PW \
  http://localhost:3000/api/prometheus/grafana/api/v1/rules \
  | jq '.data.groups[].rules[] | select(.health=="error") | {name, health, lastError}'

Step 2 — Inspect the provisioned rule definition.

curl -s -u admin:$GRAFANA_PW \
  http://localhost:3000/api/v1/provisioning/alert-rules | jq '.[] | {uid, title, condition}'

Step 3 — Grep the server logs for the evaluation error.

journalctl -u grafana-server --since "15 min ago" | grep -iE "eval|Failed to evaluate"
# In Kubernetes:
kubectl logs deploy/grafana -n monitoring | grep -i "evaluate rule"

Step 4 — Verify the datasource UID resolves.

curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/datasources \
  | jq '.[] | {name, uid, type}'

Step 5 — Confirm evaluation timeout settings.

[unified_alerting]
evaluation_timeout = 60s
max_attempts = 3

Example Root Cause Analysis

An on-call engineer noticed the High CPU rule showed Error for 40 minutes but no incident was raised. The dashboard panel with the identical node_load1 query rendered normally, which was the red herring.

Running the rules API returned:

"lastError": "failed to build query 'A': data source not found"

The team had re-provisioned Prometheus that morning through a new Helm release, which regenerated the datasource with UID prometheus-6f2a instead of the old prometheus-prod. The dashboard used a variable that resolved to the current default datasource, so it kept working — but the alert rule stored the stale UID in its JSON model.

Fix: they updated datasourceUid in the provisioning file to a fixed, pinned UID (uid: prometheus-prod set explicitly in the datasource provisioning too), redeployed, and the rule returned to ok. They added a CI check asserting datasource UIDs are stable across releases.

Prevention Best Practices

  • Pin datasource UIDs explicitly in datasources.yaml provisioning so they survive redeploys.
  • Set evaluation_timeout higher than your slowest alert query and keep min_interval sane.
  • Use instant queries plus an explicit Reduce for threshold rules to avoid long/wide type errors.
  • Configure per-rule Error handling deliberately: choose Error, Alerting, or OK on execution error rather than leaving defaults.
  • Alert on Grafana folder health: create a meta-alert on ALERTS{alertstate="firing", alertname="DatasourceError"}.
  • Version-control all /etc/grafana/provisioning/alerting/ files and review UID changes in PRs.

See more Grafana troubleshooting in /categories/grafana/ and the sibling guide on contact point send failures.

Quick Command Reference

# Find rules currently in error state
curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/prometheus/grafana/api/v1/rules \
  | jq '.data.groups[].rules[] | select(.health=="error") | {name, lastError}'

# Dump provisioned alert rules
curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/v1/provisioning/alert-rules | jq '.[].uid'

# List datasource UIDs
curl -s -u admin:$GRAFANA_PW http://localhost:3000/api/datasources | jq '.[] | {name, uid}'

# Tail evaluation errors
journalctl -u grafana-server -f | grep -i "evaluate rule"
kubectl logs -f deploy/grafana -n monitoring | grep -i "evaluate rule"

Conclusion

The top root causes of failed to evaluate rule in Grafana unified alerting:

  1. Datasource not found — the rule references a UID that no longer exists after a redeploy.
  2. Query timeout — heavy queries exceed [unified_alerting] evaluation_timeout (context deadline exceeded).
  3. Wrong data shape — feeding a long/range frame into a Reduce/Math expression (must be a wide series but got type long).
  4. Datasource backend down or panicking — the underlying Prometheus/Loki endpoint is unreachable.
  5. Misconfigured error handling — leaving default Error-state behavior so failures go unnoticed instead of paging.
Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.