AI for Incident Response
Faster RCAs, postmortems, runbooks, and on-call workflows powered by AI.
Prompts
- Intermediate
Customer-Facing Incident Comms Writer Prompt
Draft honest, empathetic external incident communications — status-page posts and customer notices across the incident lifecycle — that acknowledge impact without over-promising, leaking internals, or admitting unverified fault, for a human and legal/comms to approve before publishing.
- Claude
- ChatGPT
Open prompt - Intermediate
Firing Alert Severity & Escalation Decision Prompt
Given a firing alert and current impact signals, decide an appropriate severity level and whether to escalate or page additional responders, with explicit reasoning against your severity rubric — leaving the final call to a human.
- Claude
- ChatGPT
Open prompt - Intermediate
First-Alert Triage & Hypothesis Ranking Prompt
Take a freshly fired alert plus a snapshot of metrics, logs, and recent changes, and produce a ranked list of failure hypotheses with the cheapest next diagnostic step for each — without taking any action on the system.
- Claude
- ChatGPT
Open prompt - Beginner
Incident Status Update for Stakeholders Prompt
Turn the current state of an active incident into clear, honest internal status updates tailored to leadership, support, and engineering audiences, with a consistent cadence and no over-promising — drafts only, for a human to review and send.
- Claude
- ChatGPT
Open prompt - Intermediate
Log-Driven Incident Timeline Builder Prompt
Reconstruct a precise, normalized incident timeline from scattered logs, alert timestamps, deploy events, and chat messages — reconciling time zones and ordering correlated-but-not-causal events without inventing entries.
- Claude
- ChatGPT
Open prompt - Beginner
On-Call Shift Handoff Summary Builder Prompt
Compile a complete, skimmable on-call handoff from open incidents, recent alerts, ongoing mitigations, and watch items so the incoming engineer has full context — preserving every open thread and explicit owner without dropping risk.
- Claude
- ChatGPT
Open prompt - Intermediate
Post-Incident Follow-Up Action Items Extractor Prompt
Convert a postmortem or RCA into a prioritized, deduplicated set of SMART follow-up action items — each tied to the contributing factor it addresses, with an owner role, effort estimate, and a guardrail against busywork that doesn't reduce recurrence risk.
- Claude
- ChatGPT
Open prompt - Advanced
Structured RCA & Causal Chain Builder Prompt
Run a rigorous, blameless root-cause analysis from an incident timeline and evidence — distinguishing trigger, proximate, and systemic contributing factors, testing each causal link, and surfacing the conditions that let the failure reach production.
- Claude
- ChatGPT
Open prompt - Advanced
Targeted Rollback Plan Generator Prompt
Produce a safe, ordered rollback plan for a suspect change during an incident — with preconditions, verification gates, data/migration risks, and an abort path — as a reviewable runbook a human executes, never auto-applied.
- Claude
- ChatGPT
Open prompt - Advanced
Cache Stampede and Thundering-Herd Mitigation Prompt
Diagnose a live incident where a cache miss, flush, or restart is hammering the origin with a thundering herd, and pick the fastest safe mitigation to protect the backend without dropping all traffic.
- Claude
- ChatGPT
Open prompt - Intermediate
Cloud API Quota and Throttling Incident Triage Prompt
Triage a live incident caused by hitting a cloud-provider API rate limit or service quota, and decide whether to back off, request a quota increase, or shed the work driving the throttling.
- Claude
- ChatGPT
Open prompt - Advanced
Database Failover and Replication-Lag Decision Prompt
Decide during a live database incident whether to promote a replica, wait for the primary to recover, or hold — weighing replication lag, data-loss risk, and split-brain before you pull the trigger.
- Claude
- ChatGPT
Open prompt - Intermediate
DNS Resolution Failure Live Diagnosis Prompt
Walk on-call through diagnosing a live DNS-related outage — resolver, authoritative, caching, and propagation layers — to find where name resolution is actually breaking before you start changing records.
- Claude
- ChatGPT
Open prompt - Advanced
Emergency Load-Shedding and Rate-Limit Config Prompt
Design an emergency load-shedding or rate-limit change during an overload incident that protects the core service by dropping the least-valuable traffic first — with a clear rollback.
- Claude
- ChatGPT
Open prompt - Intermediate
Expired TLS Certificate Incident Triage Prompt
Triage a live outage caused by an expired or mis-issued TLS certificate — identify every affected endpoint, the renewal path, and a safe emergency reissue plan without breaking pinning or chains.
- Claude
- ChatGPT
Open prompt - Beginner
Is-This-Real Page Triage Prompt
Help a freshly paged on-call engineer decide in the first two minutes whether an alert is a real incident worth waking people for, a transient blip, or pure noise — before they over- or under-react.
- Claude
- ChatGPT
Open prompt - Beginner
Incident Alert-to-Owning-Team Router Prompt
Take a freshly fired alert and route it to the team that actually owns the failing component, so the right responder is paged first instead of bouncing through three on-call rotations.
- Claude
- ChatGPT
Open prompt - Beginner
Internal Tooling Outage Employee Comms Prompt
Draft clear, calm communications for an incident that only affects internal staff — CI/CD, VPN, SSO, deploy pipelines, internal dashboards — where the audience is coworkers, not customers.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Conference Bridge Noise Control Prompt
Restore signal on a chaotic major-incident voice or video bridge where too many people are talking over each other and decisions are being lost
- Claude
- ChatGPT
Open prompt - Advanced
Incident Data Integrity Verification After Recovery Prompt
Verify that data is actually correct and consistent after a service is restored, before declaring the incident resolved, when an outage may have corrupted or skipped writes
- Claude
- ChatGPT
Open prompt - Advanced
Incident Degraded-Mode Customer Tradeoff Prompt
Decide which features to intentionally shed to keep core service alive during an incident, and frame that tradeoff for customers and the business
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Deputy Commander Load-Sharing Prompt
Split incident command duties across a deputy, scribe, and comms lead when a single commander is overloaded on a large or fast-moving incident
- Claude
- ChatGPT
Open prompt - Advanced
Incident Go/No-Go Mitigation Decision Prompt
Run a fast, structured go/no-go check before executing a risky mitigation during a live incident, when the fix itself could make things worse
- Claude
- ChatGPT
Open prompt - Advanced
Incident Mid-Incident Scope Creep Control Prompt
Stop an active incident from sprawling into parallel investigations and opportunistic fixes that dilute the team and extend the outage
- Claude
- ChatGPT
Open prompt - Beginner
Incident On-Call Fatigue Handoff During Prolonged Incidents Prompt
Manage responder fatigue and rotate the team safely during a multi-hour or multi-day incident so exhausted people do not make catastrophic mistakes
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Stand-Down and All-Clear Criteria Prompt
Decide whether an incident is genuinely resolved enough to declare all-clear and stand down responders, versus prematurely closing a still-fragile system
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Third-Party Status Triage Prompt
Triage during an active incident whether a third-party or SaaS provider degradation is actually your root cause, or a red herring distracting the team
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Comms Approval and Sign-Off Workflow Prompt
Design an approval workflow for incident communications that prevents unvetted external messaging without slowing the response to a crawl
- Claude
- ChatGPT
Open prompt - Beginner
Incident First-Responder Quickstart Card Prompt
Generate a one-page quickstart card that walks a freshly paged first responder through the first ten minutes of an incident
- Claude
- ChatGPT
Open prompt - Beginner
Incident Glossary and Terminology Standardization Prompt
Build a shared incident-response glossary so severity labels, roles, and status terms mean the same thing across every team
- Claude
- ChatGPT
Open prompt - Advanced
Follow-the-Sun On-Call Overlap Coverage Design Prompt
Design follow-the-sun on-call coverage with deliberate overlap windows so incidents never fall into a handoff gap across time zones
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Severity Misclassification Audit Prompt
Audit closed incidents to find where severity was over- or under-assigned and tighten the classification process
- Claude
- ChatGPT
Open prompt - Advanced
Vendor SLA Accountability Review Prompt
Review a vendor-caused or vendor-involved incident to determine SLA breach, remedies owed, and corrective commitments to demand
- Claude
- ChatGPT
Open prompt - Intermediate
Incident War-Room Situation Board Design Prompt
Design a single-screen situation board that gives a war room shared awareness of an active incident at a glance
- Claude
- ChatGPT
Open prompt - Advanced
On-Call Compensation and Pay Policy Review Prompt
Review an on-call compensation policy for fairness, legal exposure, and alignment with actual paging load before rolling it out
- Claude
- ChatGPT
Open prompt - Beginner
Customer Incident Comms Tone and Empathy Review Prompt
Review a customer-facing incident update for tone, empathy, accuracy, and over-promising before it is published
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Decision Log Rationale Capture Prompt
Turn a noisy incident channel into a structured decision log that records what was decided, by whom, and why
- Claude
- ChatGPT
Open prompt - Intermediate
Live Incident Evidence Preservation Checklist Prompt
Generate a checklist for capturing volatile diagnostic evidence during a live incident before it is lost to restarts or rotation
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Open-Loops and Follow-Up Tracker Prompt
Track unresolved questions, deferred tasks, and loose ends during a long-running incident so nothing is dropped at handoff or closure
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Merge and Deduplication Triage Prompt
Decide whether several open incidents or alerts share a root cause and should be merged into one major incident
- Claude
- ChatGPT
Open prompt - Beginner
Incident Pre-Mortem Failure Mode Brainstorm Prompt
Run a structured pre-mortem before a risky launch or migration to surface failure modes and pre-stage mitigations
- Claude
- ChatGPT
Open prompt - Advanced
Incident War-Game Injects and Curveball Designer Prompt
Design mid-exercise injects and curveballs that test how a team adapts when an incident scenario evolves under pressure
- Claude
- ChatGPT
Open prompt - Advanced
Runbook Prerequisite and Access Audit Prompt
Audit a runbook for hidden prerequisites, missing permissions, and access dependencies that would block a responder mid-incident
- Claude
- ChatGPT
Open prompt - Beginner
Alert-Storm Correlation and Triage Prompt
Cut through a flood of simultaneous alerts during an incident to find the originating signal, group symptoms from causes, and tell on-call which single alert actually matters.
- Claude
- ChatGPT
Open prompt - Advanced
Error-Budget Policy Enforcement Review Prompt
Design and pressure-test an error-budget policy that actually changes behavior—defining what happens when the budget is exhausted, who decides, and how feature work yields to reliability work.
- Claude
- ChatGPT
Open prompt - Intermediate
Escalation Policy Gap and Single-Point-of-Failure Analysis Prompt
Audit your existing escalation policies and on-call schedules to find coverage gaps, dead-ends, and single points of failure where a page could go unanswered during a real incident.
- Claude
- ChatGPT
Open prompt - Advanced
Game-Day Hypothesis and Abort-Criteria Design Prompt
Structure a chaos game-day around a falsifiable steady-state hypothesis with explicit blast-radius limits and abort conditions, so you learn from controlled failure without causing a real outage.
- Claude
- ChatGPT
Open prompt - Advanced
Incident Commander Training Simulator Prompt
Run a branching, text-based incident simulation that puts a trainee incident commander through a realistic SEV with injects, forcing real decisions on delegation, comms, and escalation while you grade them.
- Claude
- ChatGPT
Open prompt - Intermediate
On-Call Runbook Authoring Standard Prompt
Define a house style and quality bar for writing operational runbooks so every page links to a clear, copy-pasteable, low-ambiguity procedure an exhausted on-call can follow at 3 a.m.
- Claude
- ChatGPT
Open prompt - Beginner
SEV Downgrade and Incident Closure Criteria Prompt
Build objective, signal-based criteria for when an active incident can be downgraded in severity and formally closed, so incidents end on evidence rather than optimism or fatigue.
- Claude
- ChatGPT
Open prompt - Intermediate
War-Room Scribe and Live Timeline Capture Prompt
Act as a dedicated incident scribe that turns the chaotic war-room chat into a clean, timestamped decision-and-action log in real time, freeing the IC to command instead of taking notes.
- Claude
- ChatGPT
Open prompt - Beginner
Incident Chat-Log Auto-Summarizer Prompt
Turn a raw, messy incident Slack/Teams channel transcript into a structured, timestamped summary — decisions, owners, mitigations, and open questions — ready to paste into a postmortem draft.
- Claude
- ChatGPT
Open prompt - Advanced
Incident Drill Scoring Rubric Prompt
Build an objective scoring rubric to evaluate how a team performs during an incident drill or fire drill — detection, coordination, communication, and recovery — so you can track readiness improvement over time instead of relying on gut feel.
- Claude
- ChatGPT
Open prompt - Advanced
Live Incident Hypothesis Tracker Prompt
Keep a live incident's debugging organized — track every hypothesis, the evidence for and against it, what's been ruled out, and the next highest-value experiment — so the team converges on the cause instead of chasing in circles.
- Claude
- ChatGPT
Open prompt - Beginner
On-Call Shadow and Mentorship Program Prompt
Design a structured shadow-and-reverse-shadow program that ramps new engineers onto the on-call rotation safely — with milestones, sign-off criteria, and mentor responsibilities — so nobody carries the pager unprepared.
- Claude
- ChatGPT
Open prompt - Intermediate
PagerDuty Escalation Policy Config Generator Prompt
Translate your team's on-call intent into concrete PagerDuty (or Opsgenie) configuration — services, escalation policies, schedules, urgency rules, and event-rule routing — as ready-to-apply config with the rationale spelled out.
- Claude
- ChatGPT
Open prompt - Advanced
Recovery Smoke-Test Suite Generator Prompt
Generate a fast, scriptable smoke-test suite that proves a service is genuinely healthy after a mitigation or restart — covering critical user journeys, data integrity, and downstream dependencies — before you declare an incident resolved.
- Claude
- ChatGPT
Open prompt - Intermediate
Runbook Dry-Run Validation Prompt
Stress-test a runbook before you trust it in a real incident — walk each step for ambiguity, missing preconditions, dangerous commands, and dead ends — so it actually works at 3am under pressure.
- Claude
- ChatGPT
Open prompt - Intermediate
SLA Breach and Service-Credit Calculator Prompt
Compute customer-facing SLA impact after an incident — downtime windows, affected tenants, breached commitments, and owed service credits — and draft a defensible, contract-aligned breakdown.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Acknowledgment SLA Compliance Audit Prompt
Audit how reliably your on-call program meets page-acknowledgment and first-response SLAs, find where the clock is slipping, and design enforceable targets per severity.
- Claude
- ChatGPT
Open prompt - Advanced
Incident Detection Source Effectiveness Review Prompt
Analyze where your incidents were first detected — alert, dashboard, synthetic, or angry customer — to measure how proactive your detection really is and shift more incidents to catch-it-first signals.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Recovery Verification Checklist Prompt
Build a rigorous all-clear checklist so an incident is declared resolved only after recovery is verified end-to-end — not just when the obvious symptom disappears.
- Claude
- ChatGPT
Open prompt - Beginner
Incident Stakeholder Communication Map Prompt
Build a stakeholder map for major incidents so the right people are informed at the right depth, with clear owners, channels, and triggers — before the incident, not during the scramble.
- Claude
- ChatGPT
Open prompt - Advanced
Incident Tooling Consolidation Audit Prompt
Audit the sprawl of paging, chat, ticketing, status-page, and runbook tools used during incidents, then design a consolidated, integrated toolchain that removes friction and context-switching.
- Claude
- ChatGPT
Open prompt - Beginner
Incident Update Cadence Planner Prompt
Design a severity-driven update cadence for active incidents so stakeholders get predictable, right-sized updates without the incident commander improvising timing under pressure.
- Claude
- ChatGPT
Open prompt - Advanced
Observability Gap Analysis From Incidents Prompt
Mine recent incidents to find where missing logs, metrics, or traces slowed detection and diagnosis, then prioritize the observability investments that would have shortened them most.
- Claude
- ChatGPT
Open prompt - Intermediate
Runbook Freshness and Decay Audit Prompt
Audit your runbook library for stale, broken, and untrusted procedures, then design a freshness program so on-call engineers can rely on runbooks instead of working around them.
- Claude
- ChatGPT
Open prompt - Advanced
Disaster Recovery Gameday and RTO Validation Prompt
Design a disaster-recovery gameday that actually validates your RTO/RPO by restoring from backups and failing over for real — instead of the tabletop fiction that backups 'probably' work.
- Claude
- ChatGPT
Open prompt - Intermediate
Feature-Flag Kill-Switch and Fast-Mitigation Design Prompt
Design the feature flags and kill switches that let you mitigate an incident in seconds without a deploy — and audit your existing flags for the ones that will fail you the moment you need them.
- Claude
- ChatGPT
Open prompt - Intermediate
In-Incident Severity Re-Evaluation Prompt
Mid-incident, decide whether to upgrade or downgrade the severity as new facts arrive — so you neither under-respond to a quietly growing outage nor keep executives paged on a resolved blip.
- Claude
- ChatGPT
Open prompt - Intermediate
Live Incident Log and Telemetry Correlation Assistant Prompt
Pull a coherent narrative out of scattered logs, metrics, traces, and deploy events during an active incident — surface the likely trigger and the smallest set of signals worth chasing first.
- Claude
- ChatGPT
Open prompt - Advanced
Multi-Region Failover Decision Playbook Prompt
Build a pre-decided playbook for whether and when to fail traffic to another region during an incident — including the cutover steps, the data-consistency traps, and the criteria for failing back.
- Claude
- ChatGPT
Open prompt - Intermediate
Near-Miss and Close-Call Capture Program Prompt
Design a lightweight program to capture the incidents that almost happened — the silent saves, the caught-in-staging bugs, the lucky timing — and turn them into reliability signal before they become outages.
- Claude
- ChatGPT
Open prompt - Intermediate
On-Call Schedule Fairness and Coverage Optimizer Prompt
Audit and redesign an on-call rotation so coverage is reliable and the load is distributed fairly — accounting for time zones, page volume, seniority, and the people quietly carrying more than their share.
- Claude
- ChatGPT
Open prompt - Advanced
Post-Incident SLO and Error-Budget Recalibration Prompt
After a major incident, decide whether your SLO targets, error-budget windows, and burn-rate alerts still reflect reality — or whether the incident exposed targets that are wrong, dishonest, or unmeasurable.
- Claude
- ChatGPT
Open prompt - Advanced
Regulatory and Contractual Breach Notification Drafting Prompt
During or after an incident with data-exposure or availability implications, draft the time-bound notifications you owe to regulators and contractual customers — accurately, defensibly, and without over-committing.
- Claude
- ChatGPT
Open prompt - Intermediate
ChatOps Incident Automation Bot Workflow Prompt
Design an incident-management ChatOps bot that spins up the channel, pages the right people, tracks state, posts the timeline, and drives the incident lifecycle from declare to resolve — so responders coordinate in chat instead of fighting tooling.
- Claude
- ChatGPT
Open prompt - Advanced
Error Budget Burn-Rate Alert Design Prompt
Design multi-window, multi-burn-rate SLO alerts that page only when the error budget is actually in danger — fast pages for catastrophic burn, tickets for slow leaks — eliminating both flapping and silent budget exhaustion.
- Claude
- ChatGPT
Open prompt - Advanced
Incident Financial Impact Quantification Prompt
Turn an incident's duration and blast radius into a defensible dollar figure — lost revenue, SLA credits, engineering time, and reputational drag — so leadership can prioritize reliability investment.
- Claude
- ChatGPT
Open prompt - Advanced
Noisy-Neighbor and Resource Contention Diagnosis Prompt
Diagnose incidents where a service degrades not from its own bug but from resource contention — a noisy neighbor, CPU/IO/connection-pool exhaustion, or a shared-tenancy hotspot starving everyone else on the node or cluster.
- Claude
- ChatGPT
Open prompt - Intermediate
Third-Party and Vendor Coordination During a Major Incident Prompt
Run the playbook for incidents caused by or dependent on an external vendor (cloud provider, CDN, payment processor, SaaS dependency) — escalation, status correlation, customer comms, and parallel mitigation while you wait on someone else's fix.
- Claude
- ChatGPT
Open prompt - Intermediate
Runbook-to-Automation Toil Reduction Prompt
Turn a manual on-call runbook into safe, progressively-automated remediation — identifying which steps to auto-run, which to keep human-gated, and how to ship self-healing without building a system that confidently breaks production.
- Claude
- ChatGPT
Open prompt - Intermediate
Post-Incident Customer Trust Recovery Plan Prompt
Plan the days-after-the-incident customer recovery: the public post-incident report, proactive outreach to the most-affected accounts, credit decisions, and the credibility-rebuilding follow-through that turns a bad outage into retained trust.
- Claude
- ChatGPT
Open prompt - Intermediate
Synthetic Monitoring for Faster Incident Detection Prompt
Design synthetic checks and journey probes that catch incidents before customers report them — closing the gap between failure and detection (the 'time-to-detect' phase of MTTR).
- Claude
- ChatGPT
Open prompt - Intermediate
Alert Fatigue and Pager Noise Reduction Audit Prompt
Audit your firing alerts to find the noisy, non-actionable, and duplicate pages that erode on-call trust — then cut, tune, or route them so every page that survives demands human action.
- Claude
- ChatGPT
Open prompt - Advanced
Graceful Degradation and Degraded-Mode Playbook Prompt
Design degraded-mode playbooks that keep core functionality alive when a dependency fails — feature flags to shed, fallbacks to serve, and explicit triggers for entering and exiting reduced service.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Commander Handoff for Long-Running Incidents Prompt
Build a clean IC-to-IC handoff for multi-hour or overnight incidents so context, decisions, and open threads transfer without dropping the ball or re-litigating settled calls.
- Claude
- ChatGPT
Open prompt - Advanced
Incident Response Maturity Readiness Audit Prompt
Assess your incident-response program against a maturity model across detection, response, comms, learning, and tooling — then get a prioritized 90-day improvement roadmap.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Response On-Call Onboarding Curriculum Prompt
Design a structured, week-by-week onboarding curriculum that takes a new engineer from zero to confident shadow-to-primary on-call, with shadowing milestones, reading lists, and a sign-off checklist.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Tabletop Exercise Script Builder Prompt
Generate a discussion-based tabletop exercise — injects, facilitator script, decision points, and a hotwash debrief — to stress-test your incident process without touching production.
- Claude
- ChatGPT
Open prompt - Intermediate
Live Incident Executive Briefing Generator Prompt
Turn the noisy incident channel into a crisp, recurring executive briefing — current impact, what we know, what we're doing, and the next update time — without leaking raw engineering chatter to leadership.
- Claude
- ChatGPT
Open prompt - Intermediate
Similar Past Incidents Finder Prompt
During or after an incident, mine your postmortem archive for prior incidents with the same fingerprint — symptoms, service, root cause family — so you reuse known mitigations instead of rediscovering them.
- Claude
- ChatGPT
Open prompt - Intermediate
Alert Triage Decision-Tree Builder Prompt
Turn a noisy alert stream into a deterministic, branching triage decision tree that any on-call engineer can follow to classify, route, and act on alerts in under a minute.
- Claude
- ChatGPT
Open prompt - Advanced
Corrective Action Remediation Prioritization Prompt
Turn a messy list of post-incident action items into a prioritized, sequenced remediation plan that balances risk reduction against engineering cost and prevents the same failure from recurring.
- Claude
- ChatGPT
Open prompt - Intermediate
Customer Impact Assessment During an Active Incident Prompt
Rapidly quantify who and what is affected during a live incident — segments, transactions, revenue, and SLA exposure — so severity, comms, and prioritization are grounded in real blast radius rather than gut feel.
- Claude
- ChatGPT
Open prompt - Advanced
Data-Loss and Data-Corruption Incident Runbook Prompt
Produce a careful, step-by-step runbook for handling a live data-loss or data-corruption incident — stopping the bleeding, preserving evidence, validating backups, and recovering without amplifying the damage.
- Claude
- ChatGPT
Open prompt - Advanced
Error Budget Policy and SLO Response Prompt
Design an error-budget policy and a tiered SLO-breach response after a service suffers repeated incidents — define burn-rate triggers, freeze rules, and the escalation path that converts budget burn into action.
- Claude
- ChatGPT
Open prompt - Advanced
Escalation Matrix and On-Call Policy Builder Prompt
Design an escalation matrix and on-call escalation policy that routes incidents to the right responder at the right time, with sane timeouts, fallbacks, and severity-based skip-levels so nothing dies unacknowledged at 3am.
- Claude
- ChatGPT
Open prompt - Advanced
Five Whys and Causal Tree Analysis Prompt
Drive a disciplined contributing-factors analysis using 5 Whys and causal trees that resists single-root-cause oversimplification and exposes the multiple interacting factors behind a failure.
- Claude
- ChatGPT
Open prompt - Advanced
GameDay Chaos Scenario Design Prompt
Design a safe, hypothesis-driven GameDay or chaos-engineering exercise grounded in your real incident history — with steady-state metrics, fault injections, blast-radius limits, abort criteria, and learning goals.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Action Item Tracking Prompt
Turn postmortem findings into tracked, accountable action items that actually get done — with clear owners, acceptance criteria, prioritization, and a cadence that closes the loop instead of letting them rot.
- Claude
- ChatGPT
Open prompt - Beginner
Incident Bridge Facilitation Script Prompt
Generate a facilitation script that keeps a live incident bridge (conference call) focused, time-boxed, and productive — controlling cross-talk, driving updates, and capturing decisions in real time.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Metrics Trend Analysis Prompt
Analyze a portfolio of past incidents to surface MTTR, MTTD, and frequency trends, segment by service and cause, and recommend the highest-leverage interventions to bend the curves.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Severity Classification Rubric Prompt
Design a clear, defensible SEV classification rubric that on-call engineers can apply in seconds under pressure — with crisp boundaries, escalation triggers, and downgrade rules.
- Claude
- ChatGPT
Open prompt - Beginner
Incident Status Page Communications Prompt
Draft clear, honest, and consistent status-page updates and customer comms across the lifecycle of an incident — from investigating to resolved — without over-promising or leaking internals.
- Claude
- ChatGPT
Open prompt - Intermediate
Incident Timeline Reconstruction Prompt
Reconstruct an accurate, evidence-backed incident timeline from scattered logs, deploys, pages, and chat — disambiguating timezones and correlating cause with effect for the postmortem.
- Claude
- ChatGPT
Open prompt - Intermediate
Multi-Audience Incident Comms Templates Prompt
Produce a coordinated set of incident communication templates tuned for three distinct audiences — internal responders, executives, and customers — so one source of truth fans out without contradicting itself.
- Claude
- ChatGPT
Open prompt - Beginner
On-Call Handoff Workflow Design Prompt
Design a crisp on-call shift handoff that transfers context, open incidents, and watch-items without dropping the ball — plus a sustainable rotation that fights burnout.
- Claude
- ChatGPT
Open prompt - Intermediate
On-Call Health and Burnout Review Prompt
Assess on-call load and burnout risk from incident and paging data, identify the noisiest sources and most-burdened engineers, and recommend concrete changes to make the rotation humane and sustainable.
- Claude
- ChatGPT
Open prompt - Intermediate
Operational Runbook Generator Prompt
Turn tribal knowledge into a battle-tested operational runbook that a first-time responder can execute safely at 3am — with verification steps, rollback paths, and escalation off-ramps.
- Claude
- ChatGPT
Open prompt - Advanced
Paging Policy and Escalation Tuning Prompt
Audit and redesign PagerDuty/Opsgenie escalation policies to cut needless 3am pages while guaranteeing real incidents always reach a human fast — balancing reliability against on-call health.
- Claude
- ChatGPT
Open prompt - Intermediate
Rollback vs Fix-Forward Decision Framework Prompt
Build a fast, defensible decision framework for the highest-pressure call in an incident — roll back or fix forward — weighing reversibility, blast radius, data implications, and confidence so the IC decides in minutes, not by debate.
- Claude
- ChatGPT
Open prompt - Intermediate
Runbook Gap Analysis From Incidents Prompt
Mine past incidents to find where responders lacked a runbook, where existing runbooks failed, and produce a prioritized list of runbooks to write or fix — with the specific steps each one needs.
- Claude
- ChatGPT
Open prompt - Advanced
Security Breach Incident-Response Runbook Prompt
Generate a security-breach response runbook structured around containment, eradication, and recovery — with evidence preservation, scoped isolation, and legal/notification gates so a breach is handled without destroying forensics or tipping off the attacker.
- Claude
- ChatGPT
Open prompt - Advanced
Service Dependency and Blast Radius Mapping Prompt
Map a service's upstream and downstream dependencies, identify single points of failure and shared-fate risks, and estimate the blast radius of each failure so the team can prioritize resilience work.
- Claude
- ChatGPT
Open prompt - Advanced
SEV1 Incident-Commander Live Playbook Prompt
Generate a minute-by-minute live playbook the incident commander runs during an active SEV1 — from declaration through stabilization — keeping coordination, comms, and decisions on rails under pressure.
- Claude
- ChatGPT
Open prompt - Intermediate
War-Room Roles and Responsibilities (ICS) Prompt
Define a clear ICS-style role assignment for an incident war room — incident commander, ops lead, comms, scribe, liaison — with explicit responsibilities, handoffs, and span-of-control so nobody freelances during a major incident.
- Claude
- ChatGPT
Open prompt - Intermediate
Kubernetes Pod Crash Diagnosis Prompt
Diagnose CrashLoopBackOff, OOMKilled, ImagePullBackOff, and stuck pods from kubectl output.
- Claude
- ChatGPT
- Cursor
Open prompt
Guides
- · 10 min read
AI Alert Enrichment at Page Time: Context Before You Even Open the Laptop
Use AI to enrich an alert the moment it fires — recent deploys, related signals, owning team, and likely cause — so on-call starts triage with context instead of a cold page.
Read guide - · 11 min read
Catching the Silent Degradation Your Monitoring Misses
The worst incidents are the ones nothing pages on. How to detect slow, quiet degradation — partial failures, data quality drift, and creeping latency — before customers find it first.
Read guide - · 10 min read
When the Cloud Throttles You: Diagnosing Quota and Rate-Limit Incidents
Triage live cloud-provider throttling incidents — tell rate limits from hard quotas, stop the retries that deepen them, and recover without staking everything on a support ticket.
Read guide - · 11 min read
Connection Pool Exhaustion: The Incident That Looks Like Everything Else
Diagnose and mitigate live connection-pool exhaustion incidents — the misleading symptoms, the real causes, and the fastest safe fixes that don't just move the bottleneck.
Read guide - · 10 min read
Coordinating an Incident Across Vendor Support Tickets Without Losing the Thread
When your outage depends on a vendor's fix, the support ticket becomes part of your incident. How to drive vendor escalation, track the dependency, and keep the bridge honest.
Read guide - · 11 min read
Diagnosing DNS Incidents: When It Really Is Always DNS
A layered field guide to diagnosing live DNS outages — resolver, authoritative, caching, and propagation — so you find where name resolution breaks before you touch a record.
Read guide - · 11 min read
Emergency Load-Shedding Playbooks: Dropping Traffic to Stay Alive
When scaling can't outrun an overload, deliberate load-shedding keeps the core service alive. How to rank traffic, design the shed, and recover without re-overloading.
Read guide - · 10 min read
Surviving TLS Certificate Expiry Outages Without Making Them Worse
How to triage and fix a live TLS certificate expiry outage — classify the failure, map the blast radius including mTLS and pinning, and reissue safely with a verified chain.
Read guide - · 11 min read
Taming Retry Storms: When Your Own Clients Attack the Backend
How retry storms and thundering herds turn a small failure into a major outage, how to spot them live, and the mitigations that calm the herd instead of feeding it.
Read guide - · 9 min read
AI-Assisted On-Call Shift Handoff Summaries That Lose Nothing
The worst incidents are the ones that fall through the cracks between shifts. Here's how to use AI to draft on-call handoff summaries so nothing gets dropped.
Read guide - · 15 min read
AI SRE Agents Compared (2026): Bits AI, PagerDuty & More
An honest comparison of AI SRE agents — Datadog Bits AI, PagerDuty SRE Agent, Amazon Q, Copilot for Azure, K8sGPT — by autonomy, grounding, remediation safety, and cost.
Read guide - · 10 min read
Building Rollback Decision Criteria With AI Before the Page
Deciding whether to roll back mid-incident is high stakes and high stress. Here's how to use AI to draft clear rollback criteria ahead of time so the call is faster.
Read guide - · 9 min read
Deduplicating Alert Storms With AI: Find the One Real Cause
When 200 alerts fire in two minutes, the signal drowns. Here's how to use AI to collapse an alert storm into a handful of likely root causes without losing the real one.
Read guide - · 10 min read
Estimating Incident Cost and Financial Impact With AI
Leadership always asks what an outage cost. Here's how to use AI to draft a defensible financial impact estimate fast, without inventing numbers you can't back up.
Read guide - · 9 min read
Generating Game-Day Chaos Scenarios With AI Your Team Hasn't Seen
Game days only build skill if the scenarios are realistic and varied. Here's how to use AI to generate chaos scenarios that stretch your team without trusting it to inject faults.
Read guide - · 9 min read
Monitoring Vendor Status Pages During Incidents With AI
When your incident is actually a vendor's outage, finding out fast saves an hour. Here's how to use AI to triage third-party status pages without trusting it to act.
Read guide - · 9 min read
Reducing Alert Fatigue With AI: Cut Pager Noise, Keep the Signal
Alert fatigue burns out your best responders and hides real incidents. Here's how to use AI to analyze noisy alerts and propose tuning without trusting it to silence anything.
Read guide - · 10 min read
Tracking SLO Breaches and Error Budgets During Incidents With AI
Mid-incident, nobody can do error-budget math in their head. Here's how to use AI to track SLO burn and budget impact in real time so decisions stay grounded in data.
Read guide - · 9 min read
Translating Cryptic Error Logs Into Plain English With AI
A wall of stack traces at 3am helps nobody think clearly. Here's how to use AI to translate cryptic logs into plain-language explanations without trusting it blindly.
Read guide - · 9 min read
Building a Stakeholder Notification Matrix for Incidents
Stop guessing who to notify during an outage. Build a stakeholder notification matrix and use AI to draft the right message for each audience in seconds.
Read guide - · 10 min read
Facilitating the Major Incident Bridge Call Without Chaos
How to run a major incident bridge call that stays focused, with AI handling notes and side-channel synthesis so the facilitator can keep humans coordinated.
Read guide - · 10 min read
Incident Command Handoff During Long-Running Outages
How to transfer incident command cleanly during multi-hour outages, using AI to brief the incoming commander without losing context or stalling the response.
Read guide - · 9 min read
Keeping an Incident Decision Log With AI Support
The decisions made during an incident matter as much as the timeline. Learn to keep a live decision log, with AI capturing the record while humans own the calls.
Read guide - · 9 min read
Protecting Responder Wellbeing After a Major Incident
The incident ends but the toll on responders doesn't. How to protect on-call mental health after major incidents, with AI handling busywork so humans get rest.
Read guide - · 11 min read
Running a Monthly SEV Review Board That Catches Systemic Risk
How to run a recurring SEV review board that spots cross-incident patterns, with AI synthesizing themes across postmortems while humans own the decisions.
Read guide - · 10 min read
Running Incident Tabletop Exercises That Build Real Skill
Tabletop exercises build incident response muscle without touching production. Here's how to run them well and use AI to generate realistic injects and scenarios.
Read guide - · 9 min read
AI-Assisted On-Call Handoffs That Don't Drop Context
Most on-call handoffs lose half the context the moment the shift changes. Here's how to use AI to write a brief the next person can actually act on.
Read guide - · 9 min read
Drafting Customer Incident Updates With AI: Honest and Fast
Customers forgive outages but not silence. Here's how to use AI to draft clear, honest status updates fast, without letting a model overpromise or leak details.
Read guide - · 9 min read
Drafting Runbooks From Resolved Incidents With AI
The best time to write a runbook is right after you've fixed the thing. Here's how to use AI to turn a fresh resolution into a runbook on-call can trust.
Read guide - · 9 min read
Finding Similar Past Incidents With AI: Stop Rediscovering the Fix
Half the incidents you fight at 3am, someone already solved last quarter. Here's how to use AI to surface similar past incidents and stop re-debugging them.
Read guide - · 12 min read
Humanizing Artificial Intelligence in Log Analysis: Turning Raw Server Logs Into Clear DevOps Answers
How AI turns raw Linux, Kubernetes, OpenStack, and application logs into clear, plain-English DevOps troubleshooting steps — with a human still in control.
Read guide - · 9 min read
The AI Incident Scribe: Real-Time Notes Without Pulling a Responder
Every incident needs a scribe, but assigning one means losing a responder. Here's how AI can keep a live incident record while your people stay on the fix.
Read guide - · 10 min read
Using AI to Generate Incident Hypotheses Without Anchoring the Team
A murky incident is where teams tunnel on the wrong cause. Here's how to use AI to broaden your hypothesis list without letting its first guess anchor everyone.
Read guide - · 12 min read
Best AI Tools for Incident Response in 2026 (DevOps & SRE)
A practical, vendor-honest roundup of the best AI tools for incident response in 2026 — triage, log analysis, RCA, postmortems, runbooks, and ChatOps with a human always in the loop.
Read guide - · 8 min read
Configuring PagerDuty and Opsgenie for Incident Response
Most paging tools are configured once and never touched again. Here's how to set up services, escalation policies, and routing that actually hold up under load.
Read guide - · 8 min read
Dependency Mapping: A Service Catalog for Incident Response
When a service goes down at 3am, the first question is 'what else does this take with it?' A dependency map answers it before you have to guess.
Read guide - · 8 min read
Designing an Incident Severity Matrix: Impact vs Urgency
A flat SEV1-SEV4 list breaks down the moment two incidents disagree on severity. Build a two-axis impact-versus-urgency matrix instead.
Read guide - · 9 min read
DevOps On-Call Runbook Types: A 2026 Field Guide
A field guide to DevOps on-call runbook types — diagnostic, remediation, deployment, maintenance — plus automation formats, escalation logic, and runbook vs. playbook vs. SOP.
Read guide - · 8 min read
Follow-the-Sun On-Call: Coverage Across Time Zones
Nobody should be paged at 3am if a teammate across the world is mid-afternoon. Here's how to build follow-the-sun on-call that actually hands off cleanly.
Read guide - · 10 min read
Humanizing Artificial Intelligence in Incident Response: Why DevOps Teams Need AI That Explains, Not Just Automates
Explainable AI in incident response beats black-box automation. Why DevOps teams need AI that shows its reasoning, generates step-by-step remediation, and keeps a human in the approval loop — not a bot that acts on its own.
Read guide - · 8 min read
Incident Metrics That Matter: MTTA, MTTR, and MTBF
A wall of incident KPIs that nobody acts on is just decoration. Here's which metrics actually drive reliability improvements and how to measure them honestly.
Read guide - · 8 min read
Learning From Near-Misses Before They Become Outages
The disk that almost filled. The deploy you caught in staging. Near-misses are free lessons most teams throw away — here's how to harvest them.
Read guide - · 8 min read
The Communications Lead Role in Incident Response
The incident commander runs the fix. The comms lead runs the narrative. On a real SEV1, you need both — here's what the comms lead actually does.
Read guide - · 8 min read
Writing Executive Incident Updates Leadership Will Read
Executives don't want your stack trace. They want impact, confidence, and the next decision point. Here's how to brief leadership during a live incident.
Read guide - · 8 min read
Blast-Radius Mapping: Knowing What Breaks Before It Does
During an outage the killer question is 'what else does this take down?' Here's how to map dependencies and blast radius so you answer it in seconds, not hours.
Read guide - · 8 min read
Building an Incident War Room That Works: Tooling and Roles
A chaotic incident channel makes outages longer. Here's how to set up a war room — the tooling, the roles, the channel discipline — that actually speeds recovery.
Read guide - · 8 min read
Closing the Loop: Making Incident Action Items Actually Get Done
Most postmortem action items die in a backlog and the same incident happens again. Here's how to track follow-through so your learnings actually stick.
Read guide - · 8 min read
Customer Communication During Outages: What to Say and When
How you talk to customers during an outage shapes whether they trust you after. Here's a practical framework for honest, well-timed outage communication.
Read guide - · 8 min read
Cutting Alert Noise: Designing Alerts Engineers Actually Trust
Most on-call pain isn't real incidents — it's noisy alerts that page at 3am for nothing. Here's how to design alerts on symptoms, not causes, and earn back trust.
Read guide - · 9 min read
Handling SLO and SLA Breaches: From Error Budgets to Customer Credits
An SLO breach is an engineering signal; an SLA breach is a contractual one. Here's how to handle both without panic, and how AI helps assess and communicate them.
Read guide - · 9 min read
Observability for Incidents: The Signals You Need Before 3am
Dashboards built for demos are useless during an outage. Here's how to instrument for the questions you'll actually ask at 3am, not the ones that look good.
Read guide - · 8 min read
Onboarding New Engineers to On-Call Without Throwing Them to the Wolves
Putting a new engineer on the pager cold is how you create panic and turnover. Here's a structured on-call onboarding path that builds real confidence.
Read guide - · 8 min read
Building Incident Runbooks Engineers Actually Trust at 3 AM
Most runbooks rot or get ignored mid-incident. Here's how to write runbooks that hold up under pressure, keep them current, and use AI to draft and audit them.
Read guide - · 9 min read
Designing a Healthy On-Call Rotation That Doesn't Burn People Out
On-call burnout is a design problem, not a willpower problem. A veteran SRE's guide to rotation structure, fair load, health metrics, and using AI to reduce noise.
Read guide - · 8 min read
Designing Incident Escalation Policies That Actually Reach Someone
An escalation policy fails the moment a page goes unanswered. A veteran SRE's guide to tiers, timeouts, fallbacks, and using AI to route the right severity faster.
Read guide - · 8 min read
Incident Severity Classification: A Practical SEV1-to-SEV4 Guide
Severity levels decide who wakes up and how fast you move. Here's a clear, real-world rubric for SEV1-SEV4, common mistakes, and how AI helps classify under pressure.
Read guide - · 9 min read
Running Gamedays and Chaos Experiments Without Breaking Production
Gamedays and chaos engineering find weaknesses before customers do. A veteran SRE's guide to safe experiments, blast-radius control, and AI-assisted planning.
Read guide - · 8 min read
Status-Page Communication During Incidents: Templates and Cadence
Good incident comms build trust; bad ones erode it faster than the outage. A veteran SRE's templates, cadence rules, and AI prompts for status-page updates.
Read guide - · 8 min read
The Incident Commander Role Explained for Engineering Teams
The incident commander coordinates, doesn't fix. A veteran SRE breaks down the role, the first five minutes, common mistakes, and where AI lightens the load.
Read guide - · 9 min read
How DevOps Engineers Can Use AI to Triage Production Incidents Faster
The slowest part of most incidents isn't the fix — it's the first 15 minutes of figuring out what's actually broken. Here's how to use AI to compress triage without letting it touch production.
Read guide - · 7 min read
AI-Assisted Incident Response: What Actually Helps at 3 AM
When you're paged at 3 AM, generic LLM advice wastes time. Here's what AI is genuinely good at during incidents — and where it makes things worse.
Read guide
Recommended tools
-
Claude
by Anthropic
4.8The most cautious and context-aware AI assistant for infrastructure work.
- Best for
- Production troubleshooting, postmortems, IaC review
- Pricing
- Free tier; Pro $20/mo; Team & Enterprise tiers
Read review -
Gemma
by Google DeepMind
4.4Open-weights LLM family that runs locally — for air-gapped ops, on-prem inference, and privacy-sensitive infrastructure work.
- Best for
- Air-gapped incident response, on-prem log analysis, cost-controlled bulk processing
- Pricing
- Free — open weights under Gemma terms of use; commercial use permitted
Read review -
Datadog Bits AI
by Datadog
4.2An AI SRE inside Datadog — auto-investigates alerts, queries your telemetry in plain English, and accelerates incident triage.
- Best for
- Investigating alerts and incidents inside Datadog, natural-language queries across metrics/logs/traces
- Pricing
- Bundled with Datadog; AI features vary by plan. Datadog billed per host/usage (often expensive at scale)
Read review -
PagerDuty SRE Agent
by PagerDuty
4.0An agentic AI that triages incidents like an SRE — gathers context, runs diagnostics, drafts comms, and cuts on-call toil.
- Best for
- Automated incident triage, on-call toil reduction, and stakeholder-update drafting
- Pricing
- Part of PagerDuty's AI / Advance add-ons; enterprise pricing (contact sales)
Read review