Skip to content
CloudOps
Newsletter
All guides
Reduce MTTR with AI By James Joyner IV · · 10 min read

Surfacing the Right Runbook and the Exact Next Command With AI

Knowing the cause but hunting for the runbook wastes MTTR. Use AI to surface the right runbook and the exact next command, verify-first, so mitigation starts fast.

  • #reduce-mttr
  • #mttr
  • #ai
  • #runbooks
  • #on-call

We knew exactly what was wrong eleven minutes in: the Kafka consumer group had stalled and lag was climbing toward the point where we’d start dropping events. And then we lost another eight minutes — not diagnosing, not deciding, but hunting. Which wiki page had the consumer-reset procedure? Was it the one with the --reset-offsets flag or the older one that said never to use that flag in production? The cause was solved; the response wasn’t, because the right runbook and the exact command were buried in a wiki nobody had touched in months.

This is one of the most maddening slices of MTTR: you know what to do in the abstract but can’t find the precise, current, copy-pasteable steps. AI closes that gap by retrieving the right procedure and surfacing the exact next command — as a proposal you verify, never a button it presses.

The gap between “what” and “exactly how”

Diagnosis ends with a sentence: “the consumer group is stalled.” Mitigation needs a command: the specific kafka-consumer-groups.sh invocation with the right group, topic, and reset strategy for your cluster. Between those two lives a swamp of stale wiki pages, three conflicting runbooks, and a Slack thread from last year where someone figured it out and never wrote it down. Crossing that swamp by hand, under pressure, is where the minutes go. The whole MTTR funnel has these handoff gaps, and this one is pure retrieval — perfect for AI.

Make your runbooks retrievable, then let AI rank them

AI can’t surface a good runbook from a bad corpus. The prerequisite is that your procedures exist as searchable text with clear titles and a “when to use this” line at the top. Given that, retrieval is the easy part. I keep runbooks as Markdown in a repo and let the assistant search across them:

# Find candidate runbooks for a symptom
grep -ril "consumer.*lag\|consumer.*stall\|offset.*reset" runbooks/
# Pull the "when to use" headers so you can compare candidates fast
grep -A2 "^## When to use" runbooks/kafka-*.md

Then I hand the candidates and the incident context to the model and ask it to pick and justify:

You are surfacing the correct runbook for an active incident. Given the confirmed cause, the affected component, and these candidate runbooks (titles + “when to use” sections), rank the candidates by fit. For the top match, extract the exact next command(s) with the incident-specific values filled in, and flag any prerequisite or destructive step. If two runbooks conflict, say so and explain which applies and why. Do not run anything. Mark any value you inferred rather than read from the runbook.

The output:

Best match: runbooks/kafka-consumer-reset.md — its “when to use” covers a stalled group with climbing lag, which matches the confirmed cause. The older kafka-offset-fix.md is for corrupted offsets and explicitly warns against the reset flag; it does not apply here. Next command (verify before running):

kafka-consumer-groups.sh --bootstrap-server $BROKER \
  --group events-processor --topic user-events \
  --reset-offsets --to-latest --dry-run

Prerequisite: Consumer must be stopped first or the reset is rejected. Destructive flag note: --to-latest skips un-processed events; confirm that’s acceptable for this topic before dropping --dry-run.

That’s the thing I spent eight minutes hunting for, delivered with the conflicting runbook resolved and the dangerous flag flagged.

Verify-first is non-negotiable here

This phase is where AI is most tempting to over-trust and most dangerous if you do. A surfaced command runs against production. The model filled in --group events-processor — did it read that from the runbook or guess it from context? That’s why the prompt forces it to mark inferred values, and why every command comes with --dry-run first.

My rule: the AI proposes the command, a human reads it, and a human runs it. Always. The two checks I run before pressing enter on anything AI surfaced:

# 1. Confirm the group name the command targets actually exists and is the stalled one
kafka-consumer-groups.sh --bootstrap-server $BROKER --list | grep events-processor

# 2. Run the dry-run and read what it WOULD do
kafka-consumer-groups.sh --bootstrap-server $BROKER \
  --group events-processor --topic user-events \
  --reset-offsets --to-latest --dry-run

The dry-run output shows the exact offset deltas. If they look wrong — say it would skip a million events when I expected a handful — I stop. AI surfacing the command saved me the retrieval time; it did not earn the right to skip my verification.

What “surfacing the next command” should never be

It should never be auto-execution. The line is bright: AI reads, ranks, retrieves, and fills in values as a draft. It does not touch prod. A model that auto-runs the surfaced command will eventually surface a confidently-wrong one against the wrong cluster, and there’s no undo on a production offset reset. Keep the human as the hand on the keyboard.

A few habits that make this reliable:

  • Always demand the conflict callout. Most runbook swamps have contradictory pages. Forcing the model to name conflicts is what turns retrieval from a coin-flip into a decision.
  • Insist on dry-run-first commands. If a procedure has no safe preview mode, that’s a gap to fix in the runbook, not a reason to skip verification.
  • Feed back the command you actually ran. After the incident, the real command — corrected for whatever the model got wrong — goes back into the runbook. That’s how the corpus gets better and the next surface is more accurate.

You can try the retrieval-and-surface loop on the free incident assistant: paste a confirmed cause and a couple of candidate runbooks and watch it rank them and extract the next step. The prompt library has the surfacing prompt with the inferred-value and dry-run framing already in place.

The minutes between knowing the cause and starting the fix are some of the most frustrating in any incident, because the answer exists — you just can’t find it fast enough. AI is excellent at the finding. It is not allowed to do the doing. Keep that line and you’ll cut this slice of MTTR without ever handing prod to a model.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.