Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 9 min read

Using AI to Build a Runbook Annotation Library for Your Alerts

Every alert should link a runbook, but most don't because writing them is tedious. How I use AI to draft alert annotations and runbooks useful at 3am.

  • #prometheus
  • #alerting
  • #runbooks
  • #ai
  • #incident-response

Here’s an uncomfortable truth about most alerting setups: the alerts fire correctly and the runbooks they link to either don’t exist or say something useless like “investigate the issue.” I’ve been the bleary-eyed on-call who clicked a runbook_url at 3am and landed on a 404, and it is a special kind of despair. The reason runbooks are bad isn’t that engineers don’t care — it’s that writing them is tedious, repetitive, and always lower priority than the next feature. That tedium is precisely what AI is for. I’ve used it to turn a directory of bare alert rules into a real annotation-and-runbook library, and the trick is letting the model draft while keeping the operational truth human-owned.

What a useful alert annotation contains

Before generating anything, it’s worth being clear about what good looks like, because AI will pad annotations with fluff if you don’t constrain it. A useful alert annotation answers four questions fast: what is broken, how bad is it, what’s the immediate first check, and where’s the deeper runbook. The Prometheus annotation block is where the first three live:

- alert: "ServiceLatencyP95High"
  expr: 'service:request_latency:p95_5m > 0.5'
  for: 10m
  labels:
    severity: warning
    team: payments
  annotations:
    summary: "p95 latency for {{ $labels.service }} above 500ms for 10m"
    description: "Sustained p95 over SLO target. First check: dependency latency dashboard and recent deploys for {{ $labels.service }}."
    runbook_url: "https://runbooks.internal/payments/latency"

The summary and description templating with {{ $labels.service }} is exactly the kind of mechanical, repetitive work AI does well across dozens of rules at once.

Generate annotations across the whole rule file

My first move is bulk. I paste an entire alert rule file with bare or missing annotations and prompt:

For each alert rule below, write a summary and description annotation. The summary must name the affected entity using label templating. The description must state severity context and the single most useful first diagnostic step. Do not invent runbook content — leave runbook_url as a placeholder I’ll fill.

That last instruction matters enormously. The model will hallucinate a confident-sounding first diagnostic step that’s wrong for your architecture if you let it. By scoping it to “first diagnostic step” and reviewing each one, I keep the operational claims truthful. The model is a fast junior engineer drafting; I’m the senior who knows whether “check the dependency dashboard” is actually the right first move for this service.

Pro Tip: Never let AI write the runbook’s actual remediation steps from scratch — it will produce plausible, generic, and occasionally dangerous advice. Instead, feed it YOUR real incident notes or a past postmortem and ask it to structure them into a runbook. Drafting from your truth beats generating from its imagination.

Turn postmortems into runbooks

The genuinely high-value workflow is converting incident history into runbooks. After an incident, the Slack thread and postmortem contain the real remediation knowledge — it’s just unstructured. I feed that raw material to the model:

Here’s the postmortem for last week’s checkout latency incident. Convert the remediation section into a runbook with: symptoms, likely causes ranked by probability, diagnostic queries to run, and remediation steps. Keep only what’s in the source — flag anything you’re inferring.

Because the input is real operational experience, the output is grounded. The “flag anything you’re inferring” instruction surfaces exactly where the model is guessing so I can verify or delete those parts. A runbook built from a real incident and reviewed by someone who was there is worth ten generated from first principles.

Keep diagnostic queries copy-pasteable

The best runbook section is a list of PromQL queries the on-call can paste straight into the expression browser. I have the model extract these from the alert and related metrics:

# Is it all instances or one?
service:request_latency:p95_5m{service="checkout"}

# Did a deploy correlate with the spike?
changes(kube_deployment_status_observed_generation{deployment="checkout"}[1h])

# Upstream dependency latency?
histogram_quantile(0.95, sum by (le) (rate(db_query_duration_seconds_bucket{service="checkout"}[5m])))

I verify each query returns data in our actual cluster before it goes in the runbook, because a diagnostic query that errors at 3am is worse than no query. This is the same metric-existence discipline I apply to alert rules themselves.

Keep annotations honest with templating that actually resolves

A runbook link is useless if it 404s, and an annotation is misleading if its templating silently fails to resolve. AI loves to write expressive Go-template annotations — {{ $value | humanizePercentage }}, {{ range $labels }}, conditional severity text — and a surprising fraction of them break at render time because the label they reference doesn’t exist on the alert’s series, or the function name is slightly off. A broken template doesn’t error loudly; it renders blank or as a literal {{ ... }} string, and now your on-call sees “p95 latency for above” with a hole where the service name should be.

So I treat annotation templates as code that must be exercised, not just written. I ask the model to list every label and $value reference its annotation depends on, then I confirm each one is actually present on the alert’s expression output:

annotations:
  summary: "{{ $labels.service }} p95 at {{ $value | humanize }}s (target 0.5s)"
  description: "Burning toward SLO breach. Severity escalates if sustained 30m."

If the alert expression aggregates away the service label, that {{ $labels.service }} renders empty — a classic mismatch the model won’t catch unless I make it cross-check the annotation’s references against the expression’s surviving labels. This is the same explainability discipline as everywhere: the annotation has to be verifiably correct, not just plausibly worded, before a tired human relies on it.

Wire it together and keep it reviewable

Annotations live in the rule files; runbooks live in a docs repo; the runbook_url connects them. Because both are version-controlled, they go through review — our code review dashboard catches dead links and annotations that reference labels the alert doesn’t actually have. When I’m authoring new alerts, the free Alert Rule Generator scaffolds the annotation block and runbook placeholder so the structure is there from the start and never gets skipped. During a live incident, the incident response dashboard is where this annotation work pays off, because the on-call actually has somewhere to go.

I draft annotations in bulk with Claude and convert postmortems with ChatGPT; a self-hosted model like Gemma is appealing here too when the postmortems contain sensitive internal detail you’d rather not send to a hosted API. For consistency across hundreds of alerts I keep the “write summary and description for each rule, leave runbook_url as a placeholder” prompt saved in a prompt workspace, so every batch of new alerts comes out with the same annotation shape and the same hard requirement that a human fills the operational content. The library compounds: once you’ve converted a dozen postmortems into structured runbooks, the model gets better at matching new alerts to the patterns you’ve already established, and the on-call experience improves with every incident you feed back in.

Conclusion

Runbooks are bad because they’re boring to write, and “boring to write” is the sweet spot for AI. The model can draft annotations across hundreds of rules and structure your real postmortems into usable runbooks in minutes. What it cannot do is supply the operational truth — that has to come from your incidents and your engineers, with a human verifying every remediation claim before an exhausted on-call follows it. Draft with AI, ground in reality, review before it ships. More in alertmanager routing without losing your mind and the monitoring guides.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.