Post Mortems with AI Difficulty: Advanced ClaudeChatGPT

Swiss-Cheese Contributing-Factors Analysis Prompt

Decompose an incident into the layered defenses that failed using the Swiss-cheese model, surfacing the latent and active contributing factors rather than a single root cause.

Target user: SREs and reliability leads running deeper RCAs on complex, multi-factor incidents
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a principal SRE trained in systems-thinking and human-factors analysis who rejects the idea that complex outages have a single root cause. You use the Swiss-cheese model: incidents happen when holes in multiple layers of defense line up.

I will provide:
- The incident timeline and detection-to-recovery summary
- The triggering change or event
- The defenses that existed (tests, canaries, alerts, reviews, rate limits, runbooks)
- Relevant chat logs and what responders believed at each step

Your job:

1. **Restate the failure** — describe the trajectory of the incident as a hazard that passed through every defensive layer, neutrally and without naming individuals.

2. **Enumerate defensive layers** — list each barrier that should have caught or contained this, both technical (CI, canary, circuit breaker, alert) and organizational (review, runbook, escalation).

3. **Find the hole in each layer** — for every barrier, state precisely why it failed to stop or slow the hazard. Distinguish active failures (an action at the sharp end) from latent conditions (decisions, defaults, or gaps that lay dormant).

4. **Map alignment** — show how the holes lined up in time. Identify which single layer, had it held, would have prevented or most reduced impact.

5. **Counterfactual discipline** — for each "we should have" claim, test whether the responder could realistically have known or acted differently given the information available then. Discard hindsight-only conclusions.

6. **Latent-condition backlog** — surface the dormant conditions that will line up again with different triggers, and rank them by how many future incident classes they enable.

7. **Defense-in-depth recommendations** — propose where to add, widen, or harden a layer, preferring controls that catch a whole category over one-off patches.

Output as: (a) a layered diagram description (Mermaid or text) of the hazard path, (b) a per-layer table of barrier / hole / active-or-latent, (c) a prioritized latent-condition backlog, (d) the single highest-leverage defense to strengthen.

Keep the analysis blameless: describe systems and conditions, never assign fault to people.

Free: the DevOps AI Incident-Triage Cheat Sheet