Skip to content
CloudOps
Newsletter
All guides
Reduce MTTR with AI By James Joyner IV · · 10 min read

Have We Seen This Before? Matching Symptoms to Past Fixes With AI

Re-solving a known incident from scratch wrecks MTTR. Use AI to match live symptoms to past fixes fast, verify-first, so you recall the answer instead of rediscovering it.

  • #reduce-mttr
  • #mttr
  • #ai
  • #knowledge-base
  • #sre

Forty minutes into diagnosing a gnarly intermittent timeout, someone finally scrolled far enough back in Slack and said the words that make your stomach drop: “wait, didn’t this exact thing happen in March?” It had. We’d diagnosed it, fixed it, and written a perfectly good postmortem — which nobody read, because nobody connected the symptom in front of them to the memory of last quarter. We spent forty minutes rediscovering a fix we already owned. Re-solving known incidents is one of the most demoralizing ways to burn MTTR, and it happens constantly because human recall doesn’t search well under pressure.

The thing humans are bad at — fuzzy-matching a live symptom against a pile of old incidents — is the thing AI is genuinely good at. The phase is recall, and used right it turns “let’s debug this” into “we’ve seen this, here’s what worked.”

Why we keep rediscovering fixes

Past incidents pile up in three forms: postmortem docs, resolved tickets, and chat threads. They’re full of the answers, but they’re indexed by date and title, not by symptom. When you’re staring at “intermittent 504s on the checkout path,” the relevant March postmortem is titled “Payments degradation” and the symptom is buried in paragraph four. Keyword search misses it because you searched “timeout” and the doc said “latency.” So you start from scratch. This is a recurring leak across the MTTR funnel: the knowledge exists, but it’s not reachable from the symptom you actually have.

Fuzzy semantic matching is exactly where AI beats keyword search. It can connect “intermittent 504s on checkout” to “payments latency spikes under connection-pool exhaustion” even when they share almost no literal words.

Match the symptom, surface the past fix

The move is to take your current symptom set and ask the model to find the closest past incidents and explain the match, so you can judge whether it’s really the same thing.

You are matching a live incident to past incidents. Given the current symptoms (alert, error pattern, affected component, timing) and these past incident records (title + symptom summary + root cause + fix), return the top 3 closest matches ranked by similarity. For each: the match confidence, which specific symptoms align and which differ, the root cause and fix from that incident, and the single check that would confirm this is the same failure mode. Do not assume it’s a match — surface the differences prominently. Do not recommend applying the old fix; produce the candidates for a human to verify.

What comes back is a recall list with its reasoning exposed:

  1. “Payments degradation” (March 14) — High confidence. Aligns: intermittent 504s, checkout path, load-correlated. Differs: March was us-east only; today spans two regions. Cause: connection-pool exhaustion under traffic spike. Fix: raised pool size + added circuit breaker. Confirm check: pg_stat_activity count vs max_connections.
  2. “Checkout timeouts” (Jan 9) — Medium. Aligns: 504s on checkout. Differs: Jan was a single bad deploy, cleared by rollback; no deploy correlation today. Confirm check: kubectl rollout history.
  3. “CDN origin timeouts” (last week) — Low. Aligns: timeouts. Differs: CDN-layer, not app; checkout not specifically affected. Probably unrelated.

That list is the forty minutes I should have saved. It doesn’t tell me it is the March incident — it tells me which past incident is worth checking first, and exactly which check confirms it.

Verify the match before you reuse the fix

The dangerous shortcut is “the AI says it’s March, apply the March fix.” Don’t. A symptom match is a lead, not a diagnosis. The March incident differs from today in region scope, and that difference might matter — maybe today’s cause is related but not identical. So I run the confirm check the match handed me:

# Confirm the March failure mode: is the connection pool actually exhausted again?
kubectl exec -n payments deploy/payments -- psql -tc \
  "select count(*), (select setting::int from pg_settings where name='max_connections') \
   from pg_stat_activity;"
# Cross-check the difference the match flagged: is it really multi-region this time?
curl -s "http://prom:9090/api/v1/query?query=\
sum%20by(region)(rate(payment_errors_total[5m]))" \
  | jq -r '.data.result[] | "\(.metric.region): \(.value[1])"'

If the pool is exhausted and I understand why it’s now multi-region, I have a verified match and the March fix is a strong candidate. If the pool is fine, the match is wrong despite high confidence, and I’ve lost thirty seconds instead of anchoring on a false lead. Verify-first is what makes fast recall safe recall.

Good matching requires you to feed it good history

AI can only match against what you give it, so this phase rewards teams that keep incident records in a consistent, symptom-forward shape. A little structure pays off enormously:

# Pull past incidents into a matchable corpus: symptom + cause + fix, one record each
grep -A4 "^## Symptoms" postmortems/*.md > /tmp/incident-corpus.txt
# A postmortem header worth matching against
incident: payments-degradation-2026-03-14
symptoms: ["intermittent 504s", "checkout path", "load-correlated"]
root_cause: "connection-pool exhaustion under traffic spike"
fix: "raised pool size to 200; added circuit breaker on payments-gateway"

When every postmortem leads with explicit symptoms and the fix that worked, the match quality jumps. A pile of free-text docs matches poorly; a corpus of symptom/cause/fix records matches well. Closing this MTTR leak is partly an AI problem and partly a hygiene problem.

A few rules that keep recall trustworthy:

  • Demand the differences, not just the similarities. A match that only lists what aligns is how you get burned applying a fix to a subtly different failure. The differences are where verification focuses.
  • Treat confidence as search order, not truth. High confidence means “check this first,” never “this is it.”
  • Feed today’s incident back into the corpus. Every resolved incident makes the next match better — if you record it in the matchable shape.

You can try symptom matching on the free incident assistant: paste live symptoms and a few past incident summaries and watch it rank the matches with their differences called out. The prompt library has the matcher prompt with the difference-surfacing and confirm-check framing built in.

The fix to your current incident may already exist, fully solved, in a doc nobody can recall under pressure. Human memory doesn’t search; AI does. Let it surface the candidate, then verify the match with one check before you reuse the answer. That’s how you stop rediscovering fixes and start recalling them — and recall is a lot faster than rediscovery.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.