Reducing Alert Fatigue With the USE and RED Methods

Every burned-out on-call I’ve worked with had the same root cause: hundreds of alerts wired to causes — a CPU spike, a queue depth, a single pod restart — most of which never affected a user. The cure isn’t a better notification tool. It’s two methods that shrink your alerting surface to a small, durable set of signals worth waking someone for: RED for services and USE for resources. Here’s how I apply them in Prometheus.

Why cause-based alerting fails

A CPU at 95% is not a problem. A queue at 10,000 is not a problem. A pod restart is not a problem. They become problems only when they cause user-visible harm — and most of the time they don’t. Alerting on every cause means alerting on a hundred things that are usually fine, which trains your team to ignore the pager. The whole game is to alert on symptoms (is the service hurting users?) and use causes for diagnosis, not paging.

RED: alert on services like a user feels them

RED covers anything that serves requests — APIs, web services, gRPC backends:

Rate — requests per second
Errors — failed requests per second
Duration — latency distribution

These three map directly to “is the service working?” Here’s the PromQL for each:

# Rate
sum by (service) (rate(http_requests_total[5m]))

# Errors (as a ratio — alert on this)
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))

# Duration (p99)
histogram_quantile(0.99,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)

You page on error ratio and p99 latency crossing thresholds tied to user pain — not on rate. Rate is context for the other two, not an alert on its own. A traffic spike isn’t an incident; a traffic spike that breaks the error ratio is.

USE: diagnose resources, rarely page on them

USE covers every resource — CPU, memory, disk, network, connection pools:

Utilization — fraction of time the resource is busy
Saturation — how much work is queued/waiting for it
Errors — error events from the resource

# Utilization (CPU)
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Saturation (run-queue pressure via load)
node_load5 / count by (instance) (node_cpu_seconds_total{mode="idle"})

# Errors (disk, NIC)
rate(node_network_receive_errs_total[5m])

The key insight: USE metrics are mostly for the dashboard you open after a RED alert fires, not for paging. Saturation is the exception worth paging on — a saturated resource is actively causing queuing right now. Utilization alone almost never deserves a page.

Turn RED into burn-rate alerts, not threshold alerts

A static “error ratio > 1% for 5m” still pages too often and too late. The SLO-native upgrade is burn-rate alerting: page based on how fast you’re consuming your error budget, with a fast-burn and a slow-burn rule:

groups:
  - name: red-burn-rate
    rules:
      # Fast burn: page now, budget evaporating
      - alert: HighErrorBurnFast
        expr: |
          (sum(rate(http_requests_total{status=~"5.."}[5m]))
           / sum(rate(http_requests_total[5m]))) > (14.4 * 0.001)
        for: 2m
        labels: { severity: page }
      # Slow burn: ticket, a real but gradual leak
      - alert: HighErrorBurnSlow
        expr: |
          (sum(rate(http_requests_total{status=~"5.."}[1h]))
           / sum(rate(http_requests_total[1h]))) > (3 * 0.001)
        for: 1h
        labels: { severity: ticket }

Fast burn pages a human; slow burn opens a ticket. This one change cut my page volume more than any dashboard ever did, because it ignores brief blips that self-heal and only wakes someone when the budget is genuinely at risk. The deeper SLO mechanics are covered in our SLOs and error budgets guide.

Route by severity so the right alerts wake people

Methods give you good signals; routing keeps the noisy ones off the pager. Split paging from ticketing in Alertmanager:

route:
  receiver: tickets
  routes:
    - matchers: [ severity="page" ]
      receiver: pagerduty
    - matchers: [ severity="ticket" ]
      receiver: slack-warnings

RED burn-rate fast → page. USE saturation → page if it’s actively harming. Everything else → ticket or a Slack channel nobody gets woken by. Wire this through a deliberate monitoring alert pipeline and the pager goes quiet without going blind.

Inhibition: stop the alert storm

When a node dies, you don’t want fifty pod alerts plus a node alert. Alertmanager inhibition suppresses the symptoms once the root cause fires:

inhibit_rules:
  - source_matchers: [ alertname="NodeDown" ]
    target_matchers: [ severity="page" ]
    equal: [ instance ]

One node-down page instead of a screen full of consequences. This alone fixes a huge chunk of “alert storm during an outage” fatigue.

An audit you can run this week

To shrink an existing noisy setup:

List every alert that paged in the last 30 days. For each, ask: did a user feel this, or was it a cause?
Delete or downgrade every cause-based pager to a ticket. CPU, memory-pressure, queue-depth, single-pod-restart — tickets, not pages.
Keep exactly the RED symptoms (error ratio, latency) and USE saturation as pages.
Convert RED pages to burn-rate with fast/slow tiers.
Add inhibition so root causes suppress symptoms.

A team that does this routinely drops from dozens of nightly pages to a handful of meaningful ones. The goal isn’t fewer alerts for their own sake — it’s that every page means a human is needed, which is the only way on-call stays sustainable.

Thresholds and burn-rate multipliers depend on your SLO targets and traffic. Validate every rule against your own data before paging on it.