AI-Assisted Runbook Selection: Routing Alerts to the Right

The first time I watched a junior engineer freeze during an incident, it wasn’t because they couldn’t fix the problem. It was because they couldn’t find the fix. We had 214 runbooks in a wiki, half of them stale, and the alert that fired at 2:47 AM said KubePodCrashLooping with a namespace they’d never heard of. The runbook existed. They just had no path from the alert to the page. That gap — between “an alert fired” and “here’s the procedure” — is where most of your incident latency actually lives, and it’s exactly the kind of fast, boring matching work that AI is genuinely good at. Not fixing the problem. Pointing at the right page.

The problem isn’t writing runbooks, it’s retrieving them

Teams obsess over authoring runbooks and then dump them into a flat wiki with titles like “DB stuff” and “Network fixes (old).” Retrieval is keyword search and tribal memory. When you have twenty runbooks, a human remembers them. At two hundred, nobody does, and the on-call engineer ends up grepping Confluence at 3 AM.

The reframe that helped me: treat runbook selection as a routing problem. An alert payload is structured-ish data. A runbook has metadata. The job is to match one to the other with a confidence score, surface the top candidates with reasoning, and let a human confirm before anything happens. This is the same shape of work I described in identifying and eliminating toil with AI — the AI does the tedious lookup, the human keeps the decision.

Give every runbook structured metadata

You can’t route to runbooks the model can’t see. Step one is a small, consistent metadata header on each runbook. I keep an index file separate from the prose so it’s cheap to load and embed.

# runbooks/index.yaml
- id: rb-pod-crashloop
  title: "Pod CrashLoopBackOff remediation"
  summary: "Diagnose and recover pods stuck restarting due to OOM, bad image, or failed probes."
  signals: ["KubePodCrashLooping", "CrashLoopBackOff", "container restart count high"]
  scope: ["kubernetes", "workloads"]
  services: ["any"]
  severity_hint: "sev3"
  auto_safe: false        # may this ever be auto-executed?
  path: "runbooks/k8s/pod-crashloop.md"

- id: rb-pg-replica-lag
  title: "Postgres replica lag recovery"
  summary: "Replica falling behind primary; check WAL shipping, slot retention, disk."
  signals: ["PostgresReplicationLag", "pg_replication_lag_seconds"]
  scope: ["database", "postgres"]
  services: ["orders-db", "billing-db"]
  severity_hint: "sev2"
  auto_safe: false
  path: "runbooks/db/pg-replica-lag.md"

The signals field matters most — it’s the bridge between alert names and human language. Write it the way alerts actually fire, not the way you wish they did.

Embed the metadata, match on the alert payload

Semantic search beats keyword search here because alert names drift and runbooks get written in prose. I embed a concatenation of each runbook’s title + summary + signals, store the vectors, and at alert time embed the alert payload and pull the nearest neighbors.

import numpy as np
from anthropic import Anthropic  # embeddings via your provider of choice

# Precompute once, persist to disk / a vector store
def runbook_text(rb: dict) -> str:
    return f"{rb['title']}. {rb['summary']} Signals: {', '.join(rb['signals'])}"

def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def top_candidates(alert: dict, index: list[dict], vectors: dict, k=5):
    query = f"{alert['alertname']} {alert.get('summary','')} " \
            f"namespace={alert.get('namespace','')} service={alert.get('service','')}"
    qv = embed(query)  # your embedding call
    scored = [(rb, cosine(qv, vectors[rb['id']])) for rb in index]
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:k]

Embeddings give you recall — they’ll surface the right runbook even when the wording differs. What they won’t give you is judgment about which of the top five actually fits, or whether none of them do. That’s the next layer.

Pro Tip: Re-embed your index in CI whenever a runbook’s metadata changes, and stamp the embedding model version into the stored vectors. Mixing vectors from two model versions silently wrecks your similarity scores.

Let an LLM classifier rank and explain — not decide

Embeddings hand you candidates. The LLM’s job is to read the full alert context plus the candidate summaries and produce a ranked shortlist with reasoning, so the human reviewer can audit the match in five seconds instead of reading five runbooks. Think of it as a fast junior engineer triaging: it proposes, it explains, it never executes.

def classify(alert: dict, candidates: list[dict]) -> dict:
    cand_block = "\n".join(
        f"- {c['id']}: {c['title']} — {c['summary']} (signals: {c['signals']})"
        for c in candidates
    )
    prompt = f"""You are triaging an alert. Pick the runbooks that apply.
Alert:
{alert}

Candidate runbooks:
{cand_block}

Return strict JSON:
{{"ranked": [{{"id": "...", "confidence": 0.0-1.0, "reason": "one sentence"}}],
  "no_match": false, "ambiguous": false, "notes": "..."}}
Set no_match=true if none clearly apply. Set ambiguous=true if two are equally plausible."""
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=600,
        messages=[{"role": "user", "content": prompt}],
    )
    return parse_json(resp.content[0].text)

Notice the model returns confidence, no_match, and ambiguous as first-class fields. You want the classifier to be able to say “I don’t know” loudly, because the failure mode you fear is a confident wrong route.

Gate on confidence and ambiguity before you act

This is where most of the safety lives. The classifier’s output is a proposal, and a proposal needs a gate. I split into three lanes: confident single match, ambiguous, and no-match. Only the first lane is even eligible to surface a “one-click confirm” — and even then, a human owns the click. This is the same philosophy behind confidence-gated auto-remediation: the score decides how much friction to add, not whether to skip the human.

def route_decision(result: dict) -> dict:
    ranked = result.get("ranked", [])
    if result.get("no_match") or not ranked:
        return {"action": "page_human", "reason": "no runbook matched", "candidates": ranked}

    top = ranked[0]
    second = ranked[1]["confidence"] if len(ranked) > 1 else 0.0

    # ambiguous if two candidates are close
    if result.get("ambiguous") or (top["confidence"] - second) < 0.15:
        return {"action": "present_choices", "candidates": ranked[:3]}

    if top["confidence"] >= 0.80:
        # high confidence: surface ONE runbook, pre-filled, awaiting human confirm
        return {"action": "suggest_runbook", "runbook": top["id"],
                "confidence": top["confidence"], "require_confirm": True}

    return {"action": "present_choices", "candidates": ranked[:3]}

The decision boundaries are deliberately conservative. A 0.78-confidence match doesn’t auto-anything; it lands in the on-call channel as “I think it’s rb-pod-crashloop — confirm?” The AI is a fast junior engineer here, and you don’t let a junior execute prod procedures unsupervised. You let them say “I think it’s this one” and then a human nods.

Handle the boring-but-deadly cases: ambiguous and no-match

The cases that hurt aren’t the clean matches; they’re the messy ones. An alert that maps to two runbooks because your services overlap. An alert for a brand-new system that has no runbook at all. If your router silently picks one, you’ve automated a wrong turn.

For ambiguity, present the top three with the model’s reasons inline and let the human pick — that pick is also your best training signal. For no-match, fail toward a human and capture it as a gap:

def on_no_match(alert, candidates):
    notify_oncall(
        channel="#incident",
        text=f"No confident runbook for {alert['alertname']} "
             f"in {alert.get('namespace')}. Closest: "
             + ", ".join(f"{c['id']} ({c['confidence']:.2f})" for c in candidates[:3]),
    )
    log_runbook_gap(alert)   # feeds a backlog of runbooks to write

Every no-match is a TODO for a runbook that should exist. That gap log quietly becomes one of the most useful documents you own.

Close the loop: routing that gets better

A router that never learns will decay as fast as your wiki did. The fix is cheap: record every routing decision and its human outcome — confirmed, corrected, or rejected — and feed corrections back into the metadata.

def record_feedback(alert_id, proposed_id, chosen_id, action):
    store({"alert_id": alert_id, "proposed": proposed_id,
           "chosen": chosen_id, "correct": proposed_id == chosen_id,
           "ts": now()})

When an engineer corrects a route, the alert’s signal phrasing is gold — append it to the chosen runbook’s signals list and re-embed. Over a few weeks your signals fields start to read like the alerts your team actually gets, and accuracy climbs without anyone writing an ML pipeline. If you want a head start on the prompts, the prompt workspace and our prompt library have classifier templates tuned for this, and the broader automation category covers where this fits in a self-healing stack.

One caution: never wire the router straight into something holding production credentials. The router suggests a runbook; the runbook may contain automation; that automation runs under its own scoped, least-privilege identity behind its own approval. Keep those layers separate.

Conclusion

Runbook selection is the unglamorous bottleneck that quietly costs you minutes on every incident, and it’s the rare AI use case with low downside: the worst a good router does is suggest a runbook a human declines. Build the metadata index, embed it, let a cheap model rank-and-explain, gate hard on confidence and ambiguity, and feed corrections back in. Do that and your 200 runbooks stop being a graveyard and start being a map.