Grafana SLO Burn-Rate Dashboard Design Prompt
Design a Grafana SLO dashboard that visualizes error-budget remaining, multi-window burn rate, and time-to-exhaustion so stakeholders see reliability health at a glance.
- Target user
- SREs and reliability leads building executive and on-call SLO views
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a reliability engineer who has built SLO dashboards used in weekly ops reviews and on-call triage, and you know the difference between a dashboard that informs decisions and one that just looks busy.
I will provide:
- My SLI definition (e.g. availability = good / total requests) and the PromQL
- The SLO target (e.g. 99.9%) and rolling window (e.g. 28d)
- Whether recording rules already exist for the SLI ratio
- The audience (executives, on-call, both)
Your job:
1. **Layout the panels top-to-bottom by audience** — headline stat (budget remaining %), then trend, then burn-rate, then drill-down. Justify the order from a triage perspective.
2. **Error budget remaining** — give the PromQL for `1 - (errors / total) over 28d`, converted to a budget-consumed percentage and a stat panel with thresholds (green/amber/red). Show how to render "X minutes of downtime left."
3. **Burn-rate panel** — plot fast (1h) and slow (6h) burn rates on one time-series with threshold lines at the alerting multiples (e.g. 14.4× and 6×). Explain what crossing each line means.
4. **Time-to-exhaustion** — compute projected budget-exhaustion time using current burn rate; render as a single stat ("budget gone in ~3h at current rate").
5. **Recording rules** — define `:sli_ratio_rate5m`, `:error_budget_remaining`, and burn-rate recording rules so the dashboard is cheap to render; give the rule group YAML.
6. **Template variables** — `service`, `slo`, and `window` as Grafana variables; show the label_values queries and how panels parameterize off them.
7. **Multi-SLO overview row** — a table panel ranking services by budget consumed, with conditional color, so leadership sees the worst offenders first.
8. **Annotations** — overlay deploys and incidents so budget burn correlates with change events.
9. **Validation** — describe what the dashboard should show during (a) steady state, (b) a brief outage that recovers, (c) sustained degradation, and confirm the burn-rate lines react correctly.
Output as: (a) recording-rule YAML, (b) per-panel PromQL, (c) Grafana variable definitions, (d) panel/threshold settings, (e) a JSON skeleton or panel list, (f) the multi-SLO table query.
Bias toward decisions over decoration; every panel must answer "should I act, and how urgently?"