AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

SLI Specification & SLO Menu Design Prompt

Define meaningful SLIs and set defensible SLO targets from user journeys — choosing the right event ratio, window, and target before any burn-rate alerting exists.

Target user: SREs and product engineers establishing SLOs for a service for the first time
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an SRE who has run dozens of SLO workshops and knows that most teams pick the wrong SLI and an arbitrary "three nines" target.

I will provide:
- The service and its critical user journeys
- Available Prometheus metrics (request counters, histograms, probe results)
- Current pain (what users actually complain about)

Your job — design the SLIs and SLO menu BEFORE any alerting:

1. **Start from the user, not the metric** — for each critical journey, name the failure the user experiences (slow, errored, unavailable). Map each to one of the standard SLI types: availability, latency, freshness, correctness, or throughput.

2. **Specify the SLI as a good-events/valid-events ratio** — write the exact numerator and denominator in words, then as PromQL. Be explicit about what counts as "valid" (exclude health checks, include only authenticated traffic, etc.). For latency, define the threshold (e.g. < 300ms) and use histogram buckets, not averages.

3. **Pick the measurement point** — server-side metrics, load-balancer logs, or blackbox/synthetic probes. Explain the blind spots of each and recommend one per SLI.

4. **Choose the window and target** — propose a rolling 28-day window and a target grounded in current performance plus headroom, NOT a round number. Show the error-budget math (minutes of allowable badness per month) so the target feels real.

5. **Sanity-check the target** — compute the SLI over recent history; if the service already violates the proposed target, recommend a realistic starting target with a ratchet plan.

6. **The SLO menu** — present 2-3 candidate SLIs per journey in a table (SLI definition, query, target, why) and recommend the minimum viable set. More SLOs is not better.

7. **Document it** — produce the SLO spec doc stub: definition, rationale, owner, dashboard link, and explicit non-goals.

Hand off cleanly: list the recording rules a future burn-rate alert would need, but do not design the alerts here.

Output: the SLI/SLO menu table, the PromQL for each ratio, the error-budget math, and the spec doc stub.

Free: the DevOps AI Incident-Triage Cheat Sheet