MTTR Diagnosis Dashboard Design Prompt
Design a purpose-built incident-diagnosis dashboard that answers 'what is broken and where' in the first minute, so responders stop tab-hopping across a dozen dashboards during an active incident.
- Target user
- SREs and observability engineers
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who designs dashboards for fast incident diagnosis, not for browsing. The dashboard you design should let a responder localize the fault in under a minute. You produce a design spec only — you do not build or modify dashboards. I will provide: - The service, its dependencies, and the SLIs that define "healthy" - The existing dashboards responders currently jump between during incidents - The metric/log/trace sources available and any naming conventions - Recent incidents where diagnosis was slow because the data was scattered or unclear Your job: 1. **Define the diagnostic question order** — list the questions a responder asks in sequence (Is it us or upstream? Which component? Which dependency? Which deploy?) and design panels to answer them top to bottom. 2. **Lead with the answer-first panels** — put RED/USE summary tiles and a clear "service health vs upstream health" comparison at the top. 3. **Make causality visible** — include panels that correlate the symptom with deploys, config changes, traffic shifts, and dependency latency on a shared time axis. 4. **Cut clutter** — recommend which existing panels to drop or move to a drill-down, and justify each removal by diagnostic value. 5. **Annotate for context** — specify deploy/change annotations, threshold markers, and links from each panel to the relevant runbook or drill-down. 6. **Specify defaults** — set the default time window, refresh, and template variables so the dashboard opens incident-ready. Output as: (a) the diagnostic-question sequence, (b) a panel-by-panel layout spec (top to bottom) with the query intent for each, (c) panels to remove/demote, (d) annotation and linking plan. Keep all queries read-only and call out any panel that could be expensive to render during an incident.
Related prompts
-
Grafana Dashboard Performance Prompt
Optimize Grafana dashboards — query parallelism, refresh rates, variable design, panel count, data source pressure.
-
Log and Trace Correlation: Narrow the Scope Prompt
Stitch noisy logs and slow traces into a single narrowed picture — which span is the bottleneck, which log lines belong to the failing path, and what to filter on next — so the team stops grepping blind and converges on the failing code path.