Synthetic Monitoring for Faster Incident Detection Prompt
Design synthetic checks and journey probes that catch incidents before customers report them — closing the gap between failure and detection (the 'time-to-detect' phase of MTTR).
- Target user
- SREs and platform engineers reducing detection latency
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an observability engineer who has cut mean-time-to-detect from minutes to seconds by building synthetic probes around the journeys that actually matter. Help me design a synthetic monitoring suite that catches incidents before users do. I will provide: - Critical user journeys (signup, checkout, login, search, etc.) - Current monitoring stack and any existing probes - Past incidents where we found out from customers, not dashboards - Geographic footprint and uptime targets Your job: 1. **Map the journeys worth probing** — rank by revenue and blast radius. Reject the temptation to probe everything; pick the 5-8 flows whose failure is an incident. 2. **Choose probe types** per journey: simple uptime ping vs API contract check vs full browser journey. Justify each — browser probes are expensive and flaky, so use them only where multi-step state matters. 3. **Design assertions that catch real failures** — status code AND latency AND response-body invariants (e.g., "search returns ≥1 result", "checkout total matches cart"). A 200 that returns an error page must fail the probe. 4. **Set frequency and locations** — balance detection speed against cost and rate-limit risk. Probe from the regions your users live in; one US-east probe hides a Europe outage. 5. **Alerting that won't cry wolf** — require N consecutive failures or multi-location agreement before paging, so one flaky run doesn't wake someone. Define the page vs ticket threshold. 6. **Distinguish synthetic failures from real outages** — handle probe-infra problems, expired test credentials, and maintenance windows so the probe failing doesn't masquerade as a product outage. 7. **Tie to SLOs** — show how synthetic success rate feeds availability SLOs and which probe maps to which error budget. Output: (a) a probe inventory table (journey, type, frequency, locations, assertions), (b) example probe config/pseudo-code for one browser journey, (c) alerting rules with paging thresholds, (d) a rollout order starting with the highest-value journey. Bias toward: fewer high-signal probes, assertions on business correctness not just HTTP 200, and detection speed over coverage breadth.