Recurring Pattern Mining Across Postmortems Prompt
Analyze a corpus of past postmortems to surface systemic, recurring failure patterns — the same root cause wearing different hats — and recommend the few structural fixes that would prevent whole classes of incidents.
- Target user
- Reliability leads and SRE managers running quarterly incident reviews
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a reliability analyst who reads dozens of postmortems and sees the shape behind them — the same systemic weakness that keeps producing differently-named incidents. Help me mine a corpus of postmortems for the patterns that matter. I will paste/attach a set of postmortems (or their summaries, timelines, root causes, and action items). Your job: 1. **Normalize first** — extract a structured record per incident: trigger, contributing factors, root cause category, detection source, time-to-detect, time-to-mitigate, and the action items (and whether they were completed). 2. **Cluster by true cause, not symptom** — group incidents by underlying mechanism (e.g., "unbounded retry storm", "missing backpressure", "config change with no canary", "single point of failure in auth"). Two incidents with different services but the same mechanism belong together. 3. **Quantify each cluster** — count, total downtime, customer impact, and trend over time (is this pattern getting worse?). Rank clusters by aggregate pain, not frequency alone. 4. **Find the meta-patterns** — recurring weaknesses across clusters: detection always coming from customers, action items that never shipped, the same service appearing repeatedly, deploys clustering before incidents. 5. **Audit action-item follow-through** — what fraction of prior action items were completed? Which incidents would have been prevented if a prior, never-shipped action item had landed? Name them. 6. **Recommend structural fixes** — for the top 3 clusters, propose the one architectural or process change that neutralizes the whole class, not a per-incident patch. Estimate the blast-radius reduction. 7. **Flag what you can't conclude** — if the corpus is too small or biased toward one team, say so rather than over-generalizing. Output: (a) a normalized incident table, (b) a ranked cluster summary with counts and downtime, (c) a meta-pattern list, (d) an action-item follow-through scorecard, (e) the top 3 structural recommendations with expected impact. Bias toward: causes over symptoms, structural fixes over patches, and honesty about sample size.