Skip to content
CloudOps
Newsletter
All guides
Post Mortems with AI By James Joyner IV · · 11 min read

Multi-Team Incident Postmortems: Untangling Contributing Factors With AI

Cross-team outages produce finger-pointing postmortems. Here's how to untangle contributing factors across service boundaries with AI—and keep the review blameless.

  • #postmortems
  • #postmortem
  • #ai
  • #contributing-factors
  • #sre

The worst incident review I ever sat through had four teams in the room and each one had quietly written its own version of events before the meeting. The payments team’s draft said “the database failed over without notifying us.” The data team’s draft said “payments deployed unbounded retries that overwhelmed a degraded replica.” The platform team’s said “neither team respected the documented dependency contract.” All three were correct. None of them was the whole story, and the meeting turned into ninety minutes of each team defending its boundary instead of understanding the system. We left without a shared timeline, which meant we left without a shared lesson.

Cross-team incidents are where postmortems most often fail, because the failure lives in the seams—the handoffs, the assumed contracts, the alert that one team thought the other owned. Each team can see its own slice clearly and the seam not at all. This is exactly the synthesis problem AI is good at: you have four partial accounts and you need one coherent timeline that respects all of them without letting any team’s framing win.

Why multi-team postmortems go sideways

A single-team incident has one narrator. A multi-team incident has several, and each narrates from inside their own boundary, which produces three predictable pathologies.

Boundary blindness. Each team’s account is detailed inside its services and vague at the edges. The database team knows exactly when failover started; they have no idea what payments was doing with the connection pool during it.

Implicit-contract violations. The most common multi-team root cause isn’t a bug—it’s a mismatch in assumptions about a contract that was never written down. “We assumed you’d retry with backoff.” “We assumed you’d tell us before failing over.” Nobody violated a rule because there was no rule, just two reasonable assumptions that didn’t match.

Defensive framing. Each team writes the version where they look reasonable. None of these is dishonest; they’re just self-centered in the literal sense. Merge them naively and you get contradictions; merge them well and the contradictions are the finding.

A prompt that merges accounts and surfaces seams

I collect each team’s raw notes or channel exports and have the model build one unified, neutral timeline that explicitly tags ownership and flags the seams where accounts disagree or where a handoff happened.

You are merging incident accounts from multiple teams into ONE
neutral, blameless timeline. Inputs are separate accounts from:
<list teams>.

Produce a single chronological timeline. For each event:
- Timestamp
- What happened, stated neutrally (no team is "at fault")
- Which team's system the event occurred in (tag the owner)
- Source: which team's account this came from

Then add two analysis sections:
1. SEAMS — every point where two teams' systems interacted: a
   handoff, a dependency call, a failover, an alert one team
   expected the other to own. For each, note the ASSUMPTION each
   team appears to have held, and whether those assumptions matched.
2. CONTRADICTIONS — events the accounts describe differently. Quote
   both versions. Do not resolve them; flag for the group to reconcile.

Rules: contributing-factors framing, not single root cause. Frame
every human action as reasonable given that team's information.
Never invent events to fill gaps between accounts — mark gaps as
"NO ACCOUNT: <time range>, <which boundary>".

The SEAMS section is the one that changes the meeting. Instead of four teams defending boundaries, you get a list that reads “Payments assumed the replica was healthy; Data assumed Payments would back off on errors—these assumptions did not match.” That’s not anyone’s fault. That’s a missing contract, and now it’s a visible, fixable thing.

What the merged view delivers

Seams

  • 02:14 — Database failover (Data ↔ Payments). Data initiated a manual failover. Data’s assumption: failovers are routine and self-healing; notification is best-effort. Payments’ assumption: they’d be paged before any failover affecting their primary. Assumptions did not match. No notification fired.
  • 02:21 — Retry amplification (Payments ↔ Platform). Payments’ retry logic had no jitter or cap. Payments’ assumption: the platform rate limiter would protect downstreams. Platform’s assumption: clients implement backoff; the limiter is a backstop, not a primary control. Both true, neither sufficient.

Contradictions

  • Recovery time: Data’s account says “replica healthy by 02:40”; Payments’ says “errors continued until 02:55.” Likely both correct—replica health ≠ pool recovery. Reconcile in review.

NO ACCOUNT: 02:30–02:38, platform-network boundary. No team narrated this window.

That contradiction about recovery time is gold: it’s not a lie, it’s two teams measuring different things, and reconciling it surfaces a real insight—a healthy replica didn’t mean a recovered connection pool. The “NO ACCOUNT” gap is honest about what nobody saw, which is far better than a model smoothing over the hole with invention.

The human owns the seam, the contract, and the room

AI merges the accounts; it cannot decide what the new contract should be. That’s the engineering and the politics, and it stays with people. The output’s job is to get all four teams looking at one timeline so the meeting argues about the system instead of about whose fault it was. When the seams are laid out as mismatched assumptions rather than violations, the defensiveness drops, because nobody’s being accused—a contract that was never written can’t have been broken.

Two rules I hold to. Generate the merged timeline before the review and share it, so teams arrive reconciling instead of competing. And keep a human facilitator who owns the blameless frame in the room—the model can phrase events neutrally, but it can’t notice when one team goes quiet and stops volunteering. The output that does best for cross-team incidents is contributing-factors, never single-root-cause, because the whole truth of these is that several reasonable things lined up across boundaries.

This merge prompt lives with my incident set in the prompts library, and the cross-boundary thinking runs through the rest of the postmortems category. For the blameless foundation that makes the multi-team version survivable, see the blameless postmortem guide.

The failure was in the seam. Build one timeline, surface the mismatched assumptions, and fix the contract nobody wrote.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.