Building an AI Alert Triage Bot That Routes to the Right

At 3:14 one morning I watched our #alerts channel scroll past forty messages in under a minute. A noisy disk-space warning, a genuine database failover, three flapping health checks, and somewhere in the middle, buried, the one alert that actually mattered. Nobody was paged correctly because everything dumped into the same firehose, and the on-call engineer — me — was left grepping a Slack channel with bleary eyes trying to figure out which of these was the real fire. That morning is why I built an alert triage bot. Not to handle incidents, but to do the one boring, high-value thing a human is bad at while half-asleep: read each alert, decide how bad it is and who owns it, and route it to the channel where the right people are watching.

The interesting part is that the classification step is a genuinely good fit for an LLM, and the routing step is genuinely dangerous to fully automate. So the design is half model, half guardrail.

The shape of the system

The flow is simple on paper:

Your monitoring system (Prometheus Alertmanager, Datadog, whatever) fires a webhook into a Bolt app.
The app verifies the webhook signature before doing anything else.
It extracts the alert payload and asks an LLM to classify it: severity, owning service, likely owner team.
It posts the alert into the routed channel, formatted, with the classification attached — and a human confirms before anything escalates to a page.

The model is the classifier. It never touches your infrastructure, never holds a credential, never decides on its own to page someone at 3 AM. Think of it as the fast junior engineer triaging the queue: quick reads, decent judgment, and a human reviews the call before it becomes an action with consequences.

Verify the signature first — always

Before any of the AI machinery runs, the inbound request has to be proven authentic. If your bot acts on unverified webhooks, anyone who learns your endpoint URL can inject fake alerts — or worse, prompt-inject your classifier. Slack signs every request with an HMAC SHA256 signature over the timestamp and raw body, using your signing secret. You verify it yourself:

import hashlib
import hmac
import time


def verify_slack_signature(signing_secret, request_body, timestamp, slack_signature):
    # Reject anything older than 5 minutes — replay protection
    if abs(time.time() - int(timestamp)) > 60 * 5:
        return False

    basestring = f"v0:{timestamp}:{request_body}".encode("utf-8")
    computed = "v0=" + hmac.new(
        signing_secret.encode("utf-8"),
        basestring,
        hashlib.sha256,
    ).hexdigest()

    # Constant-time comparison — never use ==
    return hmac.compare_digest(computed, slack_signature)

The pieces that matter: the v0: prefix, the x-slack-request-timestamp header (which you bound against replay), the raw unparsed body, and hmac.compare_digest so you don’t leak timing information. Slack’s own Bolt SDK does this for you when requests come from Slack, but if your monitoring system posts directly to your service, you own the verification for that inbound path. Never skip it, and never hand-roll the comparison with ==.

If you’d rather not run your own webhook receiver at all, a managed incident-response service can sit in front of this and hand you pre-verified, structured alerts.

Classifying the alert with an LLM

Here’s the core: a Bolt app that receives an alert and calls Claude to classify it. The key discipline is in what you send the model — the alert text and metadata, and nothing else. No tokens, no internal hostnames you wouldn’t want logged, no production credentials.

import os
import json
from anthropic import Anthropic
from slack_bolt import App

anthropic = Anthropic()  # reads ANTHROPIC_API_KEY from the environment
app = App(
    token=os.environ["SLACK_BOT_TOKEN"],
    signing_secret=os.environ["SLACK_SIGNING_SECRET"],
)

ROUTING = {
    "checkout-svc": "C0CHECKOUT",
    "payments": "C0PAYMENTS",
    "data-platform": "C0DATAPLAT",
}
FALLBACK_CHANNEL = "C0TRIAGE"

CLASSIFY_SCHEMA = {
    "type": "object",
    "properties": {
        "severity": {"type": "string", "enum": ["SEV-1", "SEV-2", "SEV-3", "noise"]},
        "service": {"type": "string"},
        "owner_team": {"type": "string"},
        "summary": {"type": "string"},
    },
    "required": ["severity", "service", "owner_team", "summary"],
    "additionalProperties": False,
}


def classify_alert(alert_text):
    response = anthropic.messages.create(
        model="claude-opus-4-8",
        max_tokens=1024,
        thinking={"type": "adaptive"},
        system=(
            "You triage monitoring alerts. Classify severity, the owning "
            "service, and the likely owner team. Be conservative: when unsure, "
            "lower the severity and pick the fallback service. You never take "
            "action — a human reviews your classification."
        ),
        output_config={"format": {"type": "json_schema", "schema": CLASSIFY_SCHEMA}},
        messages=[{"role": "user", "content": alert_text}],
    )
    return json.loads(response.content[0].text)

A few things worth calling out. I’m using a structured-output schema so the model can’t drift into prose — I get back a typed object I can route on. I keep max_tokens modest because this is a classification task, not an essay. And the system prompt explicitly tells the model to be conservative and reminds it that it doesn’t take action. That last line isn’t decoration; framing the model’s role keeps its outputs calibrated.

If you want to evaluate which model gives you the best triage accuracy for your alert mix, Claude and ChatGPT both handle this well — test on a sample of your real (sanitized) alerts before committing.

Routing, with a human in the loop

Now the dangerous half. Classification is cheap to get wrong — a misrouted SEV-3 is annoying. Acting on a classification is where you need a person. So the bot posts the routed message with the classification visible and interactive buttons that let the on-call engineer confirm, reclassify, or escalate.

def route_alert(alert_text):
    result = classify_alert(alert_text)
    channel = ROUTING.get(result["service"], FALLBACK_CHANNEL)

    app.client.chat_postMessage(
        channel=channel,
        text=f"[{result['severity']}] {result['summary']}",  # fallback text
        blocks=[
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": (
                        f"*{result['severity']}* — {result['summary']}\n"
                        f"*Service:* {result['service']}  "
                        f"*Owner:* {result['owner_team']}"
                    ),
                },
            },
            {
                "type": "actions",
                "elements": [
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "Confirm & Page"},
                        "style": "danger",
                        "action_id": "confirm_page",
                    },
                    {
                        "type": "button",
                        "text": {"type": "plain_text", "text": "Reclassify"},
                        "action_id": "reclassify",
                    },
                ],
            },
        ],
    )

The Confirm & Page button is the boundary. The model suggests a SEV-1 page; a human presses the button that actually wakes someone up. This is the whole philosophy in one interaction: the AI does the fast, tireless triage work, and the irreversible action stays behind a human decision. The on-call engineer can reclassify in two seconds if the model got it wrong, and that correction is far cheaper than an erroneous 3 AM page.

Pro Tip: Log every model classification alongside the human’s final decision. After a couple of weeks you’ll have a labeled dataset of where the model and your team disagree — that’s gold for tuning the system prompt, and it tells you exactly which alert types still need a tighter prompt.

Events API vs Socket Mode

Two ways to wire the Slack side. The Events API has Slack POST to a public HTTPS endpoint you host — you verify the x-slack-signature and x-slack-request-timestamp on every request, which is the verification we did above. Socket Mode opens an outbound WebSocket from your app to Slack, so you don’t expose a public endpoint at all; Slack handles transport auth via an app-level token.

For a triage bot that also receives webhooks from your monitoring system, I lean toward a hosted endpoint anyway (you need it for the monitoring webhooks), so the Events API keeps things consistent. If your bot is purely Slack-internal and you’d rather not run public infrastructure, Socket Mode is cleaner. Either way, the signature discipline on the monitoring webhook is yours to own.

Why the human-in-the-loop matters more than it seems

It’s tempting, once the classifier is hitting 90%+ accuracy on your alerts, to just let it page directly. Resist. The 10% it gets wrong isn’t random — it’s the novel, ambiguous, genuinely-bad incidents that don’t match training patterns, which is precisely the category where a wrong call costs the most. The model is your fast junior engineer; you don’t give the junior the power to wake the whole team without a senior glancing at the ticket first. Pair this with your broader monitoring and alerting setup so the human review step is a natural part of the flow, not a bottleneck.

Wrapping Up

An AI alert triage bot earns its keep by doing the boring, repetitive reading that humans do badly under pressure: parse each alert, judge severity and ownership, and route it where the right eyes are. Verify the webhook signature before you trust a single byte, keep production tokens and secrets entirely out of the model’s context, and put a human between the classification and any action with consequences. Treat the LLM as the quick, capable junior it is — fast and useful, reviewed before it ships — and your #alerts firehose turns into something an on-call engineer can actually act on at 3 AM.

Building an AI Alert Triage Bot That Routes to the Right Slack Channel