Cutting Time-to-Acknowledge With AI Alert Enrichment

It’s 3:14 a.m. and my phone is buzzing with HighErrorRate on api-gateway. I’m not awake enough to know if this is a real incident or the same flaky probe that’s paged me twice this month. So I do what every tired on-call does: I squint at the alert, decide it’s probably nothing, and snooze it for ten minutes to see if it clears. That hesitation — the gap between the page firing and me actually committing to work the incident — is time-to-acknowledge, and it’s a slice of MTTR almost nobody measures.

The reason TTA balloons is that raw alerts are context-free. A bare threshold breach forces the human to go gather context before they’ll trust it. Enrich the alert with that context up front and acknowledgment becomes a reflex instead of a deliberation.

A bare alert makes you do the work

Here’s what a typical Alertmanager payload looks like when it lands:

[FIRING:1] HighErrorRate (api-gateway production)
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05

That tells me a number crossed a line. It tells me nothing about whether I should care right now. To decide, I have to open Grafana, check if the error rate is climbing or already recovering, look at whether a deploy just went out, and confirm it’s hitting real users and not a single bad pod. That’s three to five minutes of investigation just to decide to press “ack.” Multiply that across every page and you’ve found a quiet, recurring drain on your MTTR funnel.

Put the context on the page

The fix is to attach a short enrichment block to the notification itself, generated the moment the alert fires. You already have the data sources — recent deploys, the error-rate trend, the affected service’s owner, related firing alerts. The job is correlating them fast, and that’s exactly the boring reading-and-summarizing work AI is good at.

I run a webhook receiver in front of Alertmanager that, on each firing alert, pulls a few facts and asks a model to summarize them. The query side is plain PromQL:

# Is the error rate climbing or recovering?
curl -s "http://prom:9090/api/v1/query?query=\
deriv(rate(http_requests_total{job='api-gateway',status=~'5..'}[5m])[10m:])" \
  | jq -r '.data.result[0].value[1]'

# Any deploy in the last 15 minutes?
kubectl rollout history deploy/api-gateway -n production | tail -3

Feed those facts, plus the alert labels, into a tightly scoped prompt. The instruction matters: I want a verdict and the evidence behind it, never an order to act.

You are enriching a production alert for an on-call engineer. Given the alert labels, the error-rate trend (positive = worsening), recent deploys, and any co-firing alerts, write a 3-line enrichment: (1) one-sentence plain-English summary, (2) a “likely real / likely noise” lean with the single fact that drove it, (3) the most relevant recent change. Do not recommend actions. Do not state anything not supported by the data provided.

The output that lands in Slack looks like this:

Summary: 5xx rate on api-gateway is at 7% and rising over the last 8 minutes, affecting the /checkout path. Lean: Likely real — error rate is climbing, not flat, and a deploy landed 6 minutes ago. Recent change: api-gateway rolled out v2.41.0 at 03:08 (6 min before this alert).

I read that in five seconds and I’m acking with confidence. The enrichment didn’t decide anything for me — it just did the gathering I would have done by hand, so my judgment has something to work with immediately.

Keep the human as the decision-maker

The temptation once you have AI summarizing alerts is to let it auto-resolve the ones it calls “noise.” Don’t. The enrichment’s job is to make acknowledgment fast, not to make the acknowledge decision for you. A model that’s 95% right at classifying noise will, on the 5%, silently swallow the one page that mattered. Surface the lean, show the evidence, and let the on-call press the button.

What you can automate safely is the gathering and the routing of the summary. Wire it into your Alertmanager config so the enrichment travels with the page:

receivers:
  - name: enriched-pager
    webhook_configs:
      - url: http://alert-enricher.internal:8080/enrich
        send_resolved: true

The enricher does its correlating, posts the summary to the incident channel, and only then forwards to PagerDuty. The human never sees a naked threshold breach. The send_resolved: true matters too — when the alert clears, the enrichment that travels with the resolution tells the on-call whether it self-healed or whether someone’s mitigation took effect, which saves a second round of “is it actually over?” investigation.

Measure the slice you’re attacking

If you’re going to spend effort here, instrument it. Record two timestamps per incident: when the alert fired and when a human acknowledged. The delta is TTA. Before enrichment my median TTA was dominated by “is this real?” deliberation; after, the deliberation moves into the five seconds of reading the summary. Watch the distribution, not just the mean — enrichment helps most on the ambiguous middle-of-the-night pages where you’d otherwise snooze and hope.

A few things that keep enrichment honest:

Cap the context window. Pull the five most relevant facts, not everything. A wall of metrics is just a different flavor of context-free, and a page that takes longer to read than to investigate has defeated its own purpose.
Always show the source. Every claim in the summary should map to a query the on-call can re-run. If they can’t verify it, they won’t trust it, and TTA creeps back up.
Fail open. If the enricher times out, the raw page still goes through. Never let the nice-to-have block the must-have.

You can prototype the whole loop against our free incident assistant before you build your own receiver — paste an alert and the facts you’d gather, and see what a good enrichment reads like. And if you want a starting point for the summarizer prompt, the prompt library has versions tuned for exactly this verify-first framing.

Time-to-acknowledge is the cheapest slice of MTTR to win because the data already exists — you’re just moving the work of correlating it from a half-asleep human at 3 a.m. to a model that did it the instant the alert fired. The human still decides. They just decide faster, with their eyes open.

A bare alert makes you do the work

Put the context on the page

Keep the human as the decision-maker

Measure the slice you’re attacking

Download the Free 500-Prompt DevOps AI Toolkit