AI-Assisted Incident Response: What Actually Helps at 3 AM

At 3 AM, the WiFi is cold, the dashboard is red, and you have about ninety seconds before the on-call rotation starts asking “what’s the status?” — this is not the moment for generic AI advice.

Most engineers I’ve watched use AI during real incidents end up frustrated. The model wants to explain things; the on-call wants to do things. After enough rounds of this, you start to learn what AI is actually good at during an incident — and what to skip.

Where AI genuinely helps during an incident

Reading unfamiliar log formats

You inherit on-call for a team you don’t normally cover. Their app dumps Java stack traces with custom annotation prefixes, and you can’t tell what’s signal versus framework noise. Paste 200 lines into Claude with “what is this error pattern, and which lines are the actual cause?” — you get a usable answer in seconds.

This works because the model has seen millions of stack traces. It’s not pattern-matching on your specific service; it’s pattern-matching on Java stack trace shape, then narrowing.

Generating one-off `jq` / `awk` / `kubectl` invocations

You need to find “all pods on node X that have restarted more than 3 times in the last hour, grouped by namespace, sorted by restart count.” You could write it yourself in five minutes. Or you could ask Claude and have it in fifteen seconds — and copy-paste with confidence because the command is short enough to read.

The sweet spot: commands you’d write yourself but slowly, where the model gets you 80% of the way and you verify the rest at a glance.

Drafting Slack updates for stakeholders

This is the most underrated AI-in-incidents use case. You’ve been debugging for thirty minutes; the VP of Engineering DMs you. You don’t have the bandwidth to write a clear status update and keep debugging. Drop your scratch notes into Claude with “turn this into a 4-line Slack update — no jargon, no apologies, status + next action.”

The model writes the update; you read, fix one word, send. Total cost: 20 seconds. The alternative (you writing it from scratch) is two minutes of context-switching that you don’t have.

Postmortem first draft

Once the incident is over, you’re tired and you don’t want to write. Dump your timeline of Slack messages, terminal commands, and observations into a prompt; ask for a structured postmortem in your team’s format. Iterate. This is genuinely faster than the blank page.

Where AI actively makes things worse

Diagnosing the actual root cause

Models do not have your cluster’s state. They will confidently suggest “check if the database connection pool is exhausted” when the real cause is a misconfigured retry loop in a sidecar. The advice is technically reasonable and completely irrelevant.

This is the most expensive AI failure mode in incidents: time spent investigating plausible-but-wrong hypotheses generated by a model that doesn’t know what’s actually broken.

Rule of thumb: if the AI suggests a hypothesis, treat it as the same weight as a suggestion from a peer who hasn’t logged in yet. Worth considering. Not authoritative.

Generating commands you can’t read at a glance

A 60-character kubectl invocation? Trustworthy. A 12-line jq expression with three nested selects? You will not catch the typo at 3 AM. The model will sometimes hallucinate a field name that looks plausible.

If the command is too long to read, write it yourself.

Anything involving production credentials

You will be tempted to paste production state, secrets, or credentials into the chat to get faster context. Don’t. The compliance review will find it.

For sensitive production debugging, use the AI for the shape of the problem (“what’s the typical cause of X in service Y?”) and keep the actual data in your terminal.

A workflow that works

When I’m paged, I now run something like this:

First 90 seconds: triage the dashboard manually. No AI. Get my own bearings.
First Slack update: write a one-liner myself (“Investigating elevated 5xx on payments-api, no customer impact confirmed yet, will update in 10 min”). No AI — this needs to be in my voice.
Investigation: open a Claude window. Use it for log parsing, command generation, and pattern recognition. Read every command before running.
Stakeholder update: paste my scratch notes, ask for a 4-line update, edit one word, send.
Resolution: do the actual fix manually. Verify it worked manually.
Postmortem prep: dump the timeline into AI, get a structured first draft.

Notice the model is involved in step 2-onwards but never replaces my judgment on the actual fix.

What about agentic AI (“just let it run kubectl”)?

The current generation of agentic tools — Claude Code, Cursor’s agent mode, etc. — can theoretically run kubectl commands and iterate on a hypothesis. I’ve tried this during a couple of low-severity incidents on a sandbox cluster.

The results: it works, slowly. The agent will run five commands when you’d have run two. It will sometimes go down a rabbit hole that you’d have aborted in fifteen seconds.

For low-severity, no-time-pressure debugging, this can be useful. For an active incident, you’re faster doing it yourself. The agent’s strength — being thorough — is exactly wrong for the moment.

The honest bottom line

AI in incidents is a productivity tool for the tasks adjacent to the actual debugging: log parsing, command generation, stakeholder communication, postmortem prep. It is not a debugger. It is not an SRE. Treating it as either will waste time you don’t have.

If you’re new to using AI during on-call, start by using it only for Slack updates. That’s the highest-leverage, lowest-risk entry point. Once you’ve internalized that the AI is your stenographer and not your engineer, the rest follows.

For the AI-driven prompts we use during real incidents, see the Incident Response prompts — including the incident postmortem drafter and Kubernetes pod crash diagnosis prompts.