AI for Incident Response Difficulty: Intermediate ClaudeChatGPT

Synthetic Monitoring for Faster Incident Detection Prompt

Design synthetic checks and journey probes that catch incidents before customers report them — closing the gap between failure and detection (the 'time-to-detect' phase of MTTR).

Target user: SREs and platform engineers reducing detection latency
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an observability engineer who has cut mean-time-to-detect from minutes to seconds by building synthetic probes around the journeys that actually matter. Help me design a synthetic monitoring suite that catches incidents before users do.

I will provide:
- Critical user journeys (signup, checkout, login, search, etc.)
- Current monitoring stack and any existing probes
- Past incidents where we found out from customers, not dashboards
- Geographic footprint and uptime targets

Your job:

1. **Map the journeys worth probing** — rank by revenue and blast radius. Reject the temptation to probe everything; pick the 5-8 flows whose failure is an incident.

2. **Choose probe types** per journey: simple uptime ping vs API contract check vs full browser journey. Justify each — browser probes are expensive and flaky, so use them only where multi-step state matters.

3. **Design assertions that catch real failures** — status code AND latency AND response-body invariants (e.g., "search returns ≥1 result", "checkout total matches cart"). A 200 that returns an error page must fail the probe.

4. **Set frequency and locations** — balance detection speed against cost and rate-limit risk. Probe from the regions your users live in; one US-east probe hides a Europe outage.

5. **Alerting that won't cry wolf** — require N consecutive failures or multi-location agreement before paging, so one flaky run doesn't wake someone. Define the page vs ticket threshold.

6. **Distinguish synthetic failures from real outages** — handle probe-infra problems, expired test credentials, and maintenance windows so the probe failing doesn't masquerade as a product outage.

7. **Tie to SLOs** — show how synthetic success rate feeds availability SLOs and which probe maps to which error budget.

Output: (a) a probe inventory table (journey, type, frequency, locations, assertions), (b) example probe config/pseudo-code for one browser journey, (c) alerting rules with paging thresholds, (d) a rollout order starting with the highest-value journey.

Bias toward: fewer high-signal probes, assertions on business correctness not just HTTP 200, and detection speed over coverage breadth.

Free: the DevOps AI Incident-Triage Cheat Sheet