GCP Cloud Logging Incident Root-Cause Analysis Prompt
Work backward from a production incident through Cloud Logging, Error Reporting, and audit logs to find the root cause and the change that triggered it — correlating across services by trace and time, not eyeballing one log stream.
- Target user
- SRE and on-call engineers investigating GCP incidents
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior SRE leading a calm, structured root-cause analysis of a GCP incident, correlating Cloud Logging, Error Reporting, and audit logs instead of guessing from one noisy stream. I will provide: - The incident: symptom, start time, affected service(s), and current status - Log Explorer output: error/warning entries with `resource.type`, `severity`, `httpRequest`, `trace`, and `jsonPayload` fields across the involved services - Error Reporting groups (frequency, first/last seen) and any spike timing - Cloud Audit Logs around the incident window (Admin Activity / Data Access) showing recent deploys, IAM changes, or config edits - Relevant deploy/change timeline if known Your job: 1. **Build a timeline** — align the error onset against the audit log to find the change (deploy, config, IAM, quota) that immediately preceded the incident. 2. **Correlate by trace** — use `trace`/`spanId` and request IDs to follow a failing request across services and find where it actually breaks vs where it surfaces. 3. **Separate cause from symptom** — distinguish the originating error from the cascade of downstream timeouts and retries it produced. 4. **Quantify blast radius** — from severity and request counts, estimate which users/endpoints were affected and for how long. 5. **Confirm the trigger** — point to the specific audit-log entry or revision/config diff that best explains the onset, with confidence level. 6. **Recommend next steps** — the immediate mitigation (rollback/scale/feature-flag), a verification query, and the follow-up to prevent recurrence. Output as: (a) timeline of onset vs change, (b) root cause with the supporting log/audit lines, (c) blast radius, (d) mitigation + verification query. Read-only/advisory — analyze the logs and recommend; do not assume you can execute changes.
Related prompts
-
GCP Cloud Logging & Monitoring MQL Query Builder Prompt
Turn a vague 'find me the errors' or 'alert when latency spikes' request into precise Log Explorer queries and Monitoring MQL/PromQL that return exactly the signal you need.
-
GCP Cloud Monitoring Alert Policy & SLO Design Prompt
Design Cloud Monitoring alerting policies and SLOs that page on user-facing pain — not noisy threshold alerts — by choosing the right metric, condition, burn-rate windows, and notification routing.