What Changed? AI Deploy Correlation for Faster MTTR

There’s an old SRE reflex that’s right more often than it has any business being: when something breaks, ask what changed. Most incidents are self-inflicted by a recent deploy, config push, or flag flip — but answering “what changed?” mid-incident is a manual slog. You’re tab-switching between the deploy dashboard, the feature-flag console, and the infra change log, eyeballing timestamps against the onset, trying to spot the one change that lines up. That cross-referencing is minutes of recoverable MTTR, and it’s exactly the kind of correlation a model can do in seconds.

The trap is obvious and worth naming up front: the change closest to the alert is the prime suspect and is frequently innocent. So the goal isn’t “find the most recent change” — it’s “rank the changes that could plausibly cause this specific symptom,” which is a sharper question.

Recency is a lead, not a verdict

Temporal correlation is seductive. A deploy landed ninety seconds before the alert, so it must be the cause — except it was a docs change, or a frontend tweak that can’t possibly spike database CPU. Anchoring on the nearest change wastes time and, worse, triggers a rollback of a blameless deploy that prolongs the incident while you wait for it to take effect. The fix is to score changes on mechanism as well as timing, the same verify-first instinct that runs through the whole MTTR category.

A model handed your onset timestamp and your change log can time-align and mechanism-score every change at once, which is precisely the cross-referencing you’d otherwise do by hand under pressure.

Ask for ranked suspects with a confirming check

The framing keeps it honest about correlation versus causation.

You are correlating an incident with recent changes, not declaring a cause. Onset was 14:07; symptom is elevated database CPU on the orders service. Here is the change log in the window with timestamps and owners. Time-align changes to onset (drop anything clearly after, say why). For each surviving change, score how plausibly it could produce this symptom in this scope, tied to the mechanism. Rank by timing + plausibility. Give the single read-only check that confirms or clears each top suspect. Name any change category my log probably doesn’t include. Do not recommend a rollback.

The output ranks suspects by more than recency:

| Change | Owner | vs onset | Mechanism plausibility | Confirm/clear | | orders-api v4.2 deploy | team-orders | -2 min | HIGH — adds a per-row query in hot path | diff the query, check pg_stat_statements for new heavy query | | flag new-checkout-ui 50% | team-web | -1 min | LOW — frontend only, no DB path | check flag exposure; clears if no DB calls | | nightly index rebuild | cron | -40 min | MED — could lock tables | check pg_locks for blocked queries |

Investigate orders-api v4.2 first.

The flag flipped most recently, but it’s ranked last because it can’t touch the database — which is exactly the blameless-but-recent change a naive correlation would have chased.

Confirm the link before you act

A ranked suspect is still a hypothesis. Tie the change to the symptom with a read-only check:

# Does the suspect deploy correlate to a new heavy query? (read-only)
kubectl exec -n orders deploy/pg-primary -- psql -tc \
  "select query, calls, total_exec_time from pg_stat_statements \
   order by total_exec_time desc limit 5;"

# Tie error onset to the rollout time
kubectl rollout history deploy/orders-api -n orders | tail -4

In the orders incident, pg_stat_statements showed a brand-new query consuming 70% of execution time, appearing right at the v4.2 rollout — a confirmed link, not a coincidence, which made the rollback decision evidence-based rather than a guess.

Correlation is not causation, and your log isn’t complete

Two disciplines keep this from misleading you. First, the model must never present correlation as causation — every suspect ships with a confirming check, and a rollback waits until that check ties the change to the symptom. Second, your change log is almost never complete: a dependency team’s deploy, infra autoscaling, a vendor-side change won’t appear in what you paste, so an empty or weak suspect list means “you’re probably missing changes,” not “nothing changed.”

Rules I hold to:

Never roll back on timing alone. Confirm the mechanism link first. A blameless rollback costs you the rollback window and you’re no closer to the cause.
Ask the model what categories you’re missing. The culprit is sometimes a change you didn’t think to log.
Re-rank when a check clears the top suspect. Don’t bolt new evidence onto a stale ranking — regenerate it.

You can practice this on the free incident assistant — paste an onset time and a change log and ask for the ranked, mechanism-scored suspects, then notice how the recent-but-implausible change drops down the list. The prompt library has a hardened deploy-correlation prompt with the correlation-isn’t-causation guardrail built in.

“What changed?” is the highest-yield question in incident response, and the time spent answering it by hand is pure MTTR you can recover. AI cross-references onset against your change log in seconds and ranks suspects by whether they could actually cause this symptom — and as long as every rollback waits for a confirming check, you get the speed without rolling back the wrong thing.

Recency is a lead, not a verdict

Ask for ranked suspects with a confirming check

Confirm the link before you act

Correlation is not causation, and your log isn’t complete

Download the Free 500-Prompt DevOps AI Toolkit