Unit Testing Prometheus Alert Rules With Promtool and AI

Most teams I’ve worked with deploy alert rules with zero automated tests. The rule looks right, it merges, and the first real test is the incident it was supposed to catch. That’s an absurd way to ship code that pages humans, and promtool test rules has existed for years to fix it — but writing the test fixtures by hand is tedious enough that almost nobody does. This is exactly the kind of tedious-but-deterministic work AI is good at, so I started generating my alert tests with a model. The catch, as always, is that the AI can write a test that passes without proving anything useful. Here’s how I generate alert rule tests that actually mean something.

What promtool tests actually check

A promtool rule test feeds synthetic time series into your real alert expression and asserts which alerts fire and when. You define an input series with values over time, then assert the expected alert state at a given evaluation point. It’s a genuine unit test for a piece of unsupervised production logic, and AI can write the YAML scaffolding instantly:

rule_files:
  - alerts.yml
evaluation_interval: 1m
tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{job="api", code="500"}'
        values: '0+10x10'
    alert_rule_test:
      - eval_time: 5m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              severity: warning
              job: api

The 0+10x10 notation means “start at 0, add 10 each step, for 10 steps” — a constant rate of errors. AI knows this syntax well, which is most of why generating these is fast.

Make the AI write the failing case first

A test that only proves the alert fires when it should is half a test. The dangerous bug is the alert that fires when it shouldn’t, and that’s the case the model skips unless you demand it. For every rule I ask for three tests: the alert fires on a real problem, the alert stays silent on a non-problem, and the alert respects its for: duration. The second one is where the value is:

# The negative test: a brief blip must NOT page
- interval: 1m
  input_series:
    - series: 'http_requests_total{job="api", code="500"}'
      values: '0 0 50 0 0 0 0'   # one spike, then quiet
  alert_rule_test:
    - eval_time: 6m
      alertname: HighErrorRate
      exp_alerts: []   # assert NOTHING fired

If a one-minute spike pages on-call, the for: is too short, and this test proves it before production does. The AI will happily skip this case because the positive test feels complete. It isn’t.

Test the `for:` boundary precisely

The single most common alert bug is a for: duration that’s wrong, and it’s invisible without a test. I have the model write a fixture that holds the bad condition for exactly one minute less than the for: window and asserts the alert is still pending, then another at exactly the threshold asserting it’s firing. That boundary is where the real behavior lives:

alert_rule_test:
  - eval_time: 4m    # for: 5m, condition held 4m
    alertname: HighErrorRate
    exp_alerts: []    # still pending, not firing
  - eval_time: 5m
    alertname: HighErrorRate
    exp_alerts:
      - exp_labels: { severity: warning, job: api }

This is precisely the kind of detail a human gets wrong by hand and the model gets right when asked — but only when asked.

Pro Tip: Before accepting an AI-generated test, change the threshold in the alert rule itself by a small amount and rerun promtool test rules. If the test still passes, it isn’t actually pinning the behavior you care about — a good test should break when the rule changes. AI tests that pass no matter what are theater, and this catches them in one command.

Counter resets and staleness are easy to forget

Real metrics restart, go stale, and reset. A test built only on smooth synthetic data never exercises those paths, which are exactly where alerts fail in production. I ask the model to add a fixture where the counter resets mid-window — promtool’s series notation supports gaps and resets — and assert the rate()-based alert handles it correctly:

input_series:
  - series: 'http_requests_total{job="api", code="500"}'
    values: '0+10x5 0+10x5'   # counter resets at step 5

If the alert spikes on the reset, the expression uses delta where it should use rate, and the test catches a real resilience bug. This is the same failure mode that bites AI-generated rules in production, so testing for it explicitly closes the loop.

Run it in CI, not just locally

The whole point collapses if the tests only run on my laptop. I wire promtool test rules tests/*.yml into CI so no alert rule change merges without its tests passing. Because the rules and their tests live together in Git, our code review dashboard flags any rule change that arrives without a corresponding test update, which is the social mechanism that keeps the discipline alive after the initial enthusiasm fades.

Test the multi-series and label-matching cases

Most AI-generated tests use a single input series, which never exercises the part of an alert that bites hardest in production: how it behaves when many series match. An alert with sum by (instance) or a max() aggregation behaves completely differently with ten matching series than with one, and a single-series test can pass while the real aggregation is wrong. I have the model write a fixture with several instances where only one is unhealthy, and assert exactly which alerts fire and with which labels:

input_series:
  - series: 'up{job="api", instance="a"}'
    values: '1x10'
  - series: 'up{job="api", instance="b"}'
    values: '1x10'
  - series: 'up{job="api", instance="c"}'
    values: '0x10'   # only c is down
alert_rule_test:
  - eval_time: 5m
    alertname: InstanceDown
    exp_alerts:
      - exp_labels: { job: api, instance: c }   # exactly one, the right one

This catches the aggregation bugs — a sum hiding a single failure behind a healthy total, or a missing by (instance) that collapses ten instances into one ambiguous alert. The model writes the multi-series fixture fast once asked; it just won’t think to ask itself.

Where the AI is genuinely fast vs. where you decide

The honest division of labor: the model writes the fixture YAML, the series notation, and the assertion scaffolding faster than I ever could — that’s the fast-junior-engineer part. What it can’t decide is what the test should prove. Should a 90-second error spike page? Is a counter reset expected here? Those are judgments about your service and your on-call tolerance, and the model has no basis for them. So I specify the scenarios in plain language, let it generate the YAML, and then verify each assertion expresses something I actually believe. The test has to be explainable — “this proves a brief blip doesn’t page” — before it earns a place in CI. I keep my standard scenario prompts in the prompt workspace so generating a full test suite for a new rule is a two-minute job.

Conclusion

Unit testing alert rules is the cheapest way to stop shipping pagers that fire wrong, and AI removes the tedium that kept teams from doing it. But generate the negative tests, pin the for: boundary, exercise counter resets, and verify each test breaks when the rule changes — otherwise you’ve automated reassurance instead of correctness. Pair this with the review habits in the monitoring guides, and draft the rules themselves with the Alert Rule Generator so they arrive testable. Reusable test-generation prompts are in the prompts library.