Designing Alert Rules That Don't Page You Falsely

The fastest way to destroy an on-call rotation is a noisy pager. Once people learn that two-thirds of pages are nothing, they stop reading them carefully, and the one page that mattered gets acknowledged and ignored at 3am. Alert fatigue isn’t a personality flaw — it’s a design failure. After a couple decades of carrying pagers, here’s how I write Prometheus rules that earn their interruptions.

Page on symptoms, not causes

The first principle: page a human only when a user is being hurt. High CPU is not a user-facing problem. Elevated error rate, latency past your SLO, a queue that won’t drain — those are.

This flips a lot of people’s instincts. They want an alert for every disk at 80%, every pod restart, every CPU spike. Most of those should be dashboards or tickets, not pages. If the system is degraded but users are fine, it can wait until morning.

# GOOD: pages on user-facing symptom
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.05
  for: 10m
  labels:
    severity: page
  annotations:
    summary: "Error rate above 5% for {{ $labels.service }}"

The `for` clause is your best friend

The single highest-leverage field in an alert rule is for. It says “the condition must hold continuously for this long before firing.” It’s what separates a transient blip from a real problem.

Without it, every momentary spike pages you. With for: 10m, the alert only fires if the problem persists. Almost every flappy alert I’ve ever fixed was fixed by adding or lengthening for.

Pick the duration from how long you’d actually wait before caring. A checkout outage: for: 2m. A slowly filling disk: for: 30m. Match the urgency.

Rate over windows, never instant values

Alerting on a single sample is how you page on noise. A network hiccup drops one scrape, a value momentarily looks insane, you wake up.

Always alert on a rate or an average over a window long enough to smooth out single-sample weirdness:

# fragile: one bad sample pages you
expr: node_filesystem_avail_bytes < 1e9

# robust: sustained low free space, trend-aware
expr: |
  predict_linear(node_filesystem_avail_bytes[6h], 4*3600) < 0
  and node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.1
for: 15m

The disk example pages when it will run out within four hours and is already under 10% free — not the instant it dips below an arbitrary byte count.

Make absence detectable

A subtle failure mode: your alert query returns no data and therefore never fires. If the exporter dies, up == 0 is your safety net.

- alert: TargetDown
  expr: up{job="my-service"} == 0
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "{{ $labels.instance }} has been down for 5 minutes"

Pair symptom alerts with a up-based liveness alert so a dead scrape target can’t hide a real outage behind silence.

Three severities, three destinations

I keep alert severity to three buckets, and routing follows from the label:

page — wake a human now. User impact, right now.
ticket — needs attention this week. Degraded but not urgent.
info — record it, don’t notify. Dashboards and audit.

labels:
  severity: page    # or ticket / info

Your Alertmanager config then routes page to PagerDuty, ticket to a Jira webhook or Slack, and drops info into a log channel. This keeps the pager sacred: if it buzzes, it’s real.

Write annotations for your 3am self

The alert that fires is read by someone half-asleep. Give them a head start. Every alert should answer “what broke, where, and where do I look next” in the annotations.

annotations:
  summary: "p99 latency {{ $value | humanizeDuration }} on {{ $labels.service }}"
  description: "Above SLO for 10m. Runbook: https://wiki/runbooks/latency"
  dashboard: "https://grafana/d/abc/service-overview"

A link to the runbook and the dashboard in the alert itself saves five minutes of fumbling per incident, and five minutes at 3am feels like an hour.

Test before you trust

Prometheus ships promtool, which runs unit tests against your rules. Write a test that feeds in a synthetic time series and asserts the alert fires (or doesn’t). This catches the classic bug where a label typo means the alert can never match.

promtool test rules alert_tests.yaml

I won’t merge a new alert without a test that proves it fires on the bad case and stays silent on the good one. It’s the cheapest insurance in monitoring.

Where AI helps

Drafting the shape of a good rule — symptom-based expression, sensible for, useful annotations — is repetitive, and that’s exactly what AI is good at. I describe the symptom in plain language (“page me when the payment service’s error rate stays above 5% for ten minutes”) and let it produce the YAML with the rate-based expression and a for clause already in place.

It won’t know your real thresholds or metric names, so you always tune and verify. But starting from a structurally-correct draft beats hand-typing YAML at midnight. Our Alert Rule Generator bakes these conventions in, and we keep a library of monitoring prompts for the cases it doesn’t cover.

The test of a good alerting system

Here’s how you know your alerts are good: people trust the pager. When it fires, they assume something is genuinely wrong, because experience has taught them it usually is. That trust is the whole point. Every false page chips away at it, and every well-designed rule — symptom-based, windowed, with a sensible for and a runbook link — builds it back.

Aim for a pager that’s quiet most of the time and right when it isn’t. That’s not luck. It’s design.

Generated alert rules are assistive, not authoritative. Always tune thresholds to your own system and test rules with promtool before deploying.