Cutting Alert Noise: Designing Alerts Engineers Actually

The fastest way to break an on-call rotation isn’t a big outage. It’s a steady drip of alerts that page someone at 3am, resolve themselves by the time they open the laptop, and turn out to mean nothing. Do that for a few weeks and you’ve trained your best engineers to silence the pager and roll over. The next time it’s real, nobody believes it.

I’ve inherited rotations where the on-call engineer got 40 pages a night. None of them were actionable. The fix was never “tune one threshold” — it was rethinking what an alert is for.

An alert is a request for a human to do something now

That’s the entire test. If the alert fires and the right response is “acknowledge and go back to sleep,” it should not have paged. Every page you send is a withdrawal from a finite trust account. Spend it only when a human genuinely needs to act in the next few minutes.

This gives you a clean three-way split for any monitoring signal:

Page — wake a human, something is broken for users right now.
Ticket — needs attention this week, not tonight (disk at 70%, cert expiring in 20 days).
Dashboard — useful context, no notification at all.

Most teams page on all three. That’s the root cause of alert fatigue.

Alert on symptoms, not causes

The single highest-leverage change is to move your paging alerts up the stack to where users feel pain.

Cause-based alerts — “CPU above 80%,” “a pod restarted,” “queue depth above 500” — fire constantly and correlate weakly with actual user impact. CPU at 85% might be perfectly healthy. A pod restart might be a routine rollout.

Symptom-based alerts fire on the things your SLOs are written against:

Error rate on the checkout endpoint above 2% for 5 minutes
p99 latency above 800ms for 5 minutes
Successful-request rate dropped below 99%

When a symptom alert fires, something a user cares about is actually wrong. The cause-based metrics still exist — but as dashboard context you consult during triage, not as pages.

A good rule: every paging alert should map to a customer-visible SLO. If you can’t name the SLO it protects, it probably shouldn’t page.

The four properties of a trustworthy alert

When I audit an alert, I check four things:

Actionable — there’s a clear human action. If the response is always “wait and see,” it’s a dashboard.
Urgent — it genuinely can’t wait until morning. If it can, it’s a ticket.
Linked to a runbook — the alert payload includes a link to “what do I do about this.” An alert with no runbook is a puzzle handed to someone half-asleep.
Tuned for duration, not instants — alert on “above threshold for N minutes,” never on a single scrape. Transient spikes are noise.

A practical alert audit

You don’t need a project to start cleaning this up. Pull the last 30 days of paging alerts and put them in a table:

Alert	Times fired	Times actionable	Action taken	Verdict
HighCPU	84	0	none	demote to dashboard
CheckoutErrorRate	3	3	rollback	keep
PodRestarted	51	1	none	demote to ticket
DiskWarning	12	12	cleanup next day	demote to ticket

Anything that fired more than a handful of times with zero actions is noise. Demote it. This single exercise typically removes more than half the page volume, and it’s the most credibility you’ll ever build with your on-call team in an afternoon.

Where AI helps with alert design

This is a great use of AI because it’s pure reasoning over text — no production access required. Paste your alerting rules and a month of firing history and ask:

“Here are my Prometheus alerting rules and the firing history for the last 30 days. For each alert, classify it as page / ticket / dashboard based on how often it fired and whether it was actionable. Flag any cause-based alerts that should be replaced with a symptom-based equivalent, and draft the replacement rule using these metric names: [your real metric names].”

The model is good at spotting that you’re paging on five different causes that all manifest as the same symptom — and suggesting you collapse them into one symptom alert. Give it your real metric names so it doesn’t invent PromQL. We keep a set of alert-tuning prompts for this kind of audit.

You can also have it draft the runbook stub for each surviving alert, so every page links to “here’s what this means and the first read-only command to run.”

De-duplication and grouping

Once your alerts are symptom-based, the last layer of noise is the storm: one root cause lighting up twenty downstream alerts at once. Handle this at the routing layer, not by deleting alerts:

Group related alerts into a single notification (by service, by cluster).
Inhibit downstream alerts when an upstream one is firing — if the database is down, don’t also page for every service that depends on it.
Throttle re-notification so a single ongoing issue doesn’t re-page every 30 seconds.

A well-configured Alertmanager (or equivalent) turns “23 pages in two minutes” into “one page that says the database is down.”

What good looks like

After a real cleanup, a healthy week on-call looks like a handful of pages, every one of which was worth waking up for. The engineer trusts that when the pager goes off, it’s real — so they respond fast instead of groggily debating whether to bother.

That trust is the actual deliverable. You’re not optimizing a metric; you’re protecting the reflex that makes incident response work at all. If you want the structured version — paste your rules and firing history, get a page/ticket/dashboard classification and symptom-based rewrites — that’s part of what we built the AI Incident Response Assistant for.

Generated alert classifications and rules are assistive, not authoritative. Always validate tuning against your own SLOs and traffic before rolling it out.

Cutting Alert Noise: Designing Alerts Engineers Actually Trust