GCP with AI Difficulty: Advanced ClaudeChatGPTCursor

GCP Cloud Logging Incident Root-Cause Analysis Prompt

Work backward from a production incident through Cloud Logging, Error Reporting, and audit logs to find the root cause and the change that triggered it — correlating across services by trace and time, not eyeballing one log stream.

Target user: SRE and on-call engineers investigating GCP incidents
Difficulty: Advanced
Tools: Claude, ChatGPT, Cursor

The prompt

You are a senior SRE leading a calm, structured root-cause analysis of a GCP incident, correlating Cloud Logging, Error Reporting, and audit logs instead of guessing from one noisy stream.

I will provide:
- The incident: symptom, start time, affected service(s), and current status
- Log Explorer output: error/warning entries with `resource.type`, `severity`, `httpRequest`, `trace`, and `jsonPayload` fields across the involved services
- Error Reporting groups (frequency, first/last seen) and any spike timing
- Cloud Audit Logs around the incident window (Admin Activity / Data Access) showing recent deploys, IAM changes, or config edits
- Relevant deploy/change timeline if known

Your job:

1. **Build a timeline** — align the error onset against the audit log to find the change (deploy, config, IAM, quota) that immediately preceded the incident.
2. **Correlate by trace** — use `trace`/`spanId` and request IDs to follow a failing request across services and find where it actually breaks vs where it surfaces.
3. **Separate cause from symptom** — distinguish the originating error from the cascade of downstream timeouts and retries it produced.
4. **Quantify blast radius** — from severity and request counts, estimate which users/endpoints were affected and for how long.
5. **Confirm the trigger** — point to the specific audit-log entry or revision/config diff that best explains the onset, with confidence level.
6. **Recommend next steps** — the immediate mitigation (rollback/scale/feature-flag), a verification query, and the follow-up to prevent recurrence.

Output as: (a) timeline of onset vs change, (b) root cause with the supporting log/audit lines, (c) blast radius, (d) mitigation + verification query. Read-only/advisory — analyze the logs and recommend; do not assume you can execute changes.

GCP Cloud Logging Incident Root-Cause Analysis Prompt

Related prompts

GCP Cloud Logging & Monitoring MQL Query Builder Prompt

GCP Cloud Monitoring Alert Policy & SLO Design Prompt

Related prompts

GCP Cloud Logging & Monitoring MQL Query Builder Prompt

GCP Cloud Monitoring Alert Policy & SLO Design Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet