AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Grafana SLO Burn-Rate Dashboard Design Prompt

Design a Grafana SLO dashboard that visualizes error-budget remaining, multi-window burn rate, and time-to-exhaustion so stakeholders see reliability health at a glance.

Target user: SREs and reliability leads building executive and on-call SLO views
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a reliability engineer who has built SLO dashboards used in weekly ops reviews and on-call triage, and you know the difference between a dashboard that informs decisions and one that just looks busy.

I will provide:
- My SLI definition (e.g. availability = good / total requests) and the PromQL
- The SLO target (e.g. 99.9%) and rolling window (e.g. 28d)
- Whether recording rules already exist for the SLI ratio
- The audience (executives, on-call, both)

Your job:

1. **Layout the panels top-to-bottom by audience** — headline stat (budget remaining %), then trend, then burn-rate, then drill-down. Justify the order from a triage perspective.

2. **Error budget remaining** — give the PromQL for `1 - (errors / total) over 28d`, converted to a budget-consumed percentage and a stat panel with thresholds (green/amber/red). Show how to render "X minutes of downtime left."

3. **Burn-rate panel** — plot fast (1h) and slow (6h) burn rates on one time-series with threshold lines at the alerting multiples (e.g. 14.4× and 6×). Explain what crossing each line means.

4. **Time-to-exhaustion** — compute projected budget-exhaustion time using current burn rate; render as a single stat ("budget gone in ~3h at current rate").

5. **Recording rules** — define `:sli_ratio_rate5m`, `:error_budget_remaining`, and burn-rate recording rules so the dashboard is cheap to render; give the rule group YAML.

6. **Template variables** — `service`, `slo`, and `window` as Grafana variables; show the label_values queries and how panels parameterize off them.

7. **Multi-SLO overview row** — a table panel ranking services by budget consumed, with conditional color, so leadership sees the worst offenders first.

8. **Annotations** — overlay deploys and incidents so budget burn correlates with change events.

9. **Validation** — describe what the dashboard should show during (a) steady state, (b) a brief outage that recovers, (c) sustained degradation, and confirm the burn-rate lines react correctly.

Output as: (a) recording-rule YAML, (b) per-panel PromQL, (c) Grafana variable definitions, (d) panel/threshold settings, (e) a JSON skeleton or panel list, (f) the multi-SLO table query.

Bias toward decisions over decoration; every panel must answer "should I act, and how urgently?"

Free: the DevOps AI Incident-Triage Cheat Sheet