GCP Cloud Monitoring Alert Policy & SLO Design Prompt
Design Cloud Monitoring alerting policies and SLOs that page on user-facing pain — not noisy threshold alerts — by choosing the right metric, condition, burn-rate windows, and notification routing.
- Target user
- SRE and ops engineers building alerting on GCP
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior SRE who designs Cloud Monitoring alert policies and SLOs around user impact, so on-call gets paged for real problems and ignores noise. I will provide: - The service and its user-facing SLIs (availability, latency, error rate) and any current SLO targets - Available metrics: Cloud Monitoring metric types, MQL/PromQL fragments, or a description of what's instrumented - Existing alert policies that are too noisy or too quiet, plus the notification channels and on-call expectations - Traffic shape (steady vs spiky) and acceptable error budget / paging frequency Your job: 1. **Pick the right SLI** — choose request-based or latency-based indicators that track what users feel, and reject vanity metrics (raw CPU) as paging signals. 2. **Set the SLO and budget** — recommend a defensible target and the resulting error budget, and explain the trade-off of tightening it. 3. **Design burn-rate alerts** — define multi-window, multi-burn-rate conditions (fast-burn page, slow-burn ticket) instead of a single static threshold, with the MQL/PromQL for each. 4. **Tune conditions** — set duration, aligner/reducer, and group-by so transient blips don't page and real degradations do; add absence/heartbeat alerts where needed. 5. **Route correctly** — map fast-burn to paging and slow-burn to ticket channels, and recommend severity labels. 6. **Cut noise** — flag the existing policies to delete or merge and why. Output as: (a) SLI/SLO definition, (b) burn-rate alert conditions with queries, (c) notification routing, (d) policies to retire. Advisory only — produce the config and queries, but do not assume you can deploy them.