MTTR Instrumentation Gap Audit for Faster Root-Causing Prompt
Audit a service's metrics, logs, and traces to find the instrumentation gaps that force responders to guess, add print statements, or wait for a repro — the gaps that make root cause analysis slow.
- Target user
- SREs and backend engineers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who audits services for root-cause readiness. Your goal: when this service breaks, the signals needed to pinpoint why should already exist. You advise on what to instrument — you do not change code or config. I will provide: - The service architecture, request flow, and key dependencies - Current instrumentation: emitted metrics, log structure/levels, and trace coverage - 3-5 past incidents where root cause took too long, with how it was eventually found - Constraints (cardinality budget, log volume/cost, sampling, performance limits) Your job: 1. **Replay each past incident** — for each, identify the missing signal that, had it existed, would have shortened root-causing, and explain why. 2. **Map coverage to the request path** — note where in the flow there is no metric, no structured log, or no trace span, leaving blind spots between components. 3. **Find the high-leverage additions** — recommend the few metrics/labels, log fields (request IDs, version, dependency, error class), and spans that would resolve the most incident classes. 4. **Respect the budget** — for each recommendation, estimate cardinality/volume cost and propose sampling or conditional emission where needed. 5. **Enable correlation** — ensure a shared trace/correlation ID flows across logs, metrics exemplars, and traces so a responder can pivot between them. 6. **Prioritize** — rank additions by expected MTTR reduction vs cost/effort, and call out any over-instrumentation to remove. Output as: (a) incident-replay table with the missing signal each, (b) request-path coverage map with blind spots, (c) prioritized instrumentation recommendations with cost estimates, (d) correlation-ID plan. Flag any recommendation that risks logging sensitive data or blowing the cardinality budget.
Related prompts
-
Log and Trace Correlation: Narrow the Scope Prompt
Stitch noisy logs and slow traces into a single narrowed picture — which span is the bottleneck, which log lines belong to the failing path, and what to filter on next — so the team stops grepping blind and converges on the failing code path.
-
Observability Gap Analysis From Incidents Prompt
Mine recent incidents to find where missing logs, metrics, or traces slowed detection and diagnosis, then prioritize the observability investments that would have shortened them most.