Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 9 min read

Prometheus Error Guide: 'rule manager error evaluating rule' Runtime Evaluation Failure

Fix Prometheus rule manager 'Evaluating rule failed' errors: dedupe vector matches, make recording-rule labelsets unique, and tame heavy or many-to-one queries.

  • #prometheus-monitoring
  • #troubleshooting
  • #errors
  • #rules

Exact Error Message

The rule manager logs a per-evaluation failure each time a recording or alerting rule’s PromQL throws when run against live data:

level=warn ts=2026-06-27T09:14:02.118Z caller=manager.go:719 component="rule manager" file=/etc/prometheus/rules/app.yml group=app.rules name=job:http_errors:rate5m index=2 msg="Evaluating rule failed" err="found duplicate series for the match group {job=\"api\"} on the right hand-side of the operation: [{__name__=\"http_requests_total\", job=\"api\", instance=\"10.0.4.1:9100\"}, {__name__=\"http_requests_total\", job=\"api\", instance=\"10.0.4.2:9100\"}]; many-to-many matching not allowed: matching labels must be unique on one side"

A recording rule whose output collides with itself fails with a different string:

level=warn ts=2026-06-27T09:14:32.004Z caller=manager.go:719 component="rule manager" file=/etc/prometheus/rules/app.yml group=app.rules name=job:request:rate5m index=0 msg="Evaluating rule failed" err="vector cannot contain metrics with the same labelset"

When this fires, the rule’s result for that interval is dropped — nothing is recorded and no alert state is updated. On the /rules page (and via GET /api/v1/rules) the group shows the rule with Health=err and a populated LastError. An alerting rule in error does not fire, so a real incident can go unnoticed while the rule is silently failing.

What the Error Means

Unlike invalid rule group, which is a config-load failure that blocks the whole reload (the rule never even loads), an evaluation failure is a per-interval runtime error. The config loaded fine and promtool check rules passed — the PromQL is syntactically valid. It only breaks when the query is actually executed against live data, because the data triggers the error (duplicate series, a labelset collision, too many samples).

Each interval, the rule manager runs the rule’s expr. If the query engine returns an error, the manager logs msg="Evaluating rule failed", sets that rule’s health to err, stores the message as LastError, and skips it for that evaluation. It tries again on the next interval; if the data condition clears, the rule recovers automatically. Nothing is rejected at load time, so a /-/reload will not surface this — only the logs and /rules will.

Common Causes

  • Vector matching duplicate series. A binary op (/, *, and) where one side has more than one series for the same match-group labels — found duplicate series for the match group ... many-to-many matching not allowed.
  • Many-to-one ambiguity in a division that needs an explicit group_left / group_right.
  • Recording-rule labelset collision. The rule’s output produces two series with identical labels — vector cannot contain metrics with the same labelset. Often caused by aggregating away a label that was the only thing keeping series distinct.
  • Labels the rule adds duplicate existing ones. labels: on a recording rule that overwrite a label already present, merging two series into one.
  • Query hitting max_samplesquery processing would load too many samples into memory in query execution. A heavy expr (large range, high-cardinality matcher) exceeds --query.max-samples.
  • A subquery exceeding limits, or a [5m:] subquery that fans out far more series than expected.
  • A function applied to the wrong metric type (e.g. histogram_quantile on a non-bucket series, rate() on a gauge that surfaces only at certain data shapes).
  • A slow rule timing out when the group’s evaluation exceeds its interval.

How to Reproduce the Error

Write a syntactically valid rule whose division has duplicate series on the right-hand side:

# /etc/prometheus/rules/app.yml
groups:
  - name: app.rules
    interval: 30s
    rules:
      # http_requests_total is per-instance; dividing by it without
      # aggregation leaves multiple instances per job -> duplicate match group
      - record: job:http_errors:rate5m
        expr: |
          sum(rate(http_requests_total{code=~"5.."}[5m])) by (job)
          / rate(http_requests_total[5m]) > 0

promtool check rules passes (the PromQL parses), but at evaluation time the right-hand side has one series per instance for each job, so the match group {job="api"} is not unique and the rule logs found duplicate series for the match group ... many-to-many matching not allowed. A recording rule that aggregates away instance while another rule writes the same metric name reproduces vector cannot contain metrics with the same labelset.

Diagnostic Commands

All of these are read-only. First, list every rule currently in error with its lastError:

curl -s http://localhost:9090/api/v1/rules \
  | jq '.data.groups[].rules[] | select(.health=="err") | {name, lastError, type}'
{
  "name": "job:http_errors:rate5m",
  "lastError": "found duplicate series for the match group {job=\"api\"} ...",
  "type": "recording"
}

Copy that rule’s expr and run it directly to see the real PromQL error the engine returns:

curl -s --data-urlencode 'query=sum(rate(http_requests_total{code=~"5.."}[5m])) by (job) / rate(http_requests_total[5m])' \
  http://localhost:9090/api/v1/query | jq '.status, .error'
"error"
"found duplicate series for the match group {job=\"api\"} on the right hand-side ..."

Pull the rule manager’s own log lines for the failures:

journalctl -u prometheus -n 80 --no-pager | grep -i 'Evaluating rule failed'

Spot slow groups (a rule timing out shows up as a high evaluationTime near or above the group interval):

curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | {name, evaluationTime}'

You can also watch the failure counter trend over time:

rate(prometheus_rule_evaluation_failures_total[5m])

Step-by-Step Resolution

1. List the failing rules. Use the /api/v1/rules filter above to get every rule with health=="err" plus its lastError and whether it is a recording or alerting rule.

2. Run the expr directly. Paste the rule’s expr into /api/v1/query. The instant query returns the same error the rule manager hit, but interactively — this is the fastest way to see the real PromQL problem.

3. Fix vector matching. For found duplicate series for the match group, make the match group unique. Aggregate the right-hand side to the same labels as the left, or add explicit matching:

sum(rate(http_requests_total{code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)

For a legitimate many-to-one (e.g. dividing per-instance by a per-job total), use group_left / group_right:

rate(http_requests_total{code=~"5.."}[5m])
/ on(job) group_left
sum(rate(http_requests_total[5m])) by (job)

Where two instances should be one, dedupe with max by(...) (or sum/avg) to collapse the extra series.

4. Make recording-rule output unique. For vector cannot contain metrics with the same labelset, add the aggregation that keeps series distinct, or stop dropping the differentiating label. Check that the rule’s labels: block does not add a label that duplicates one already on the series and merges them:

- record: job:request:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job)   # keep `job`, don't collapse to one series

5. Address resource limits. For query processing would load too many samples, tighten the matcher, shorten the range, or split a heavy rule into smaller ones; only raise --query.max-samples if the cost is acceptable. For timeouts, lengthen the group interval or break the group up.

6. Fix type mismatches. Ensure histogram_quantile runs on _bucket series with le, and that rate/increase are applied to counters, not gauges.

7. Reload after editing the rule file. An expr change in a rule file is read by re-loading the file — you do need to reload after editing it so Prometheus picks up the new expr:

promtool check rules /etc/prometheus/rules/app.yml \
  && curl -s -XPOST http://localhost:9090/-/reload \
  && curl -s http://localhost:9090/api/v1/rules \
       | jq '.data.groups[].rules[] | select(.health=="err") | .name'

After the next evaluation interval, the rule’s health should flip back to ok and the LastError clears.

Prevention and Best Practices

  • Always aggregate both sides of a binary operation to the same label set, or be explicit with on() / group_left / group_right. Unscoped divisions are the top cause of found duplicate series.
  • Follow the level:metric:operations recording-rule naming convention and keep each record: metric written by exactly one rule, so labelset collisions are obvious.
  • Add promtool test rules unit tests with realistic multi-instance input — syntax checks pass, but only sample data reproduces matching and labelset errors.
  • Alert on prometheus_rule_evaluation_failures_total and on prometheus_rule_group_last_duration_seconds approaching the group interval so silent failures and slow groups surface.
  • Watch prometheus_rule_group_iterations_missed_total; missed iterations mean a group is too heavy and rules are being skipped.
  • Keep an eye on --query.max-samples; size heavy rules to fit, rather than raising the limit globally.
  • invalid rule group / found error when loading rules — the config-time counterpart. That one blocks the reload and the rule never loads; this one loads fine and fails only at runtime against live data.
  • query timed out / too many samples — the same max_samples and timeout limits that break heavy ad-hoc queries also break heavy recording rules during evaluation.
  • vector cannot contain metrics with the same labelset — the specific evaluation error for a recording rule whose output series collide.

Frequently Asked Questions

Is this the same as invalid rule group? No. invalid rule group is a config-load failure that rejects the whole reload before any rule runs. Evaluating rule failed happens at runtime — the rule loaded fine and only breaks when its PromQL executes against live data. Different cause, different fix, different place to look (logs and /rules, not the reload response).

Do I lose data when a rule fails to evaluate? Yes, for that interval. The rule’s result is dropped: a recording rule records nothing and an alerting rule does not update its state (so it cannot fire). Prometheus retries on the next interval and recovers automatically once the data condition clears.

Why does promtool check rules pass but the rule still errors? check rules only validates that the PromQL parses and durations/templates are well-formed. It does not run the query against your data, so data-dependent errors (duplicate series, labelset collisions, sample limits) only appear at evaluation time.

How do I find which rules are currently failing? Query GET /api/v1/rules and filter for health=="err"; each failing rule carries a lastError you can read directly, then re-run its expr against /api/v1/query to debug interactively.

Do I need to reload after fixing the expr? Yes. The fix lives in the rule file, and Prometheus only re-reads rule files on reload. Run promtool check rules, then POST /-/reload, then confirm the rule’s health returns to ok on the next evaluation.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.