Prometheus Exemplars & Trace Correlation Prompt
Wire Prometheus exemplars end-to-end so a spike on a latency histogram links directly to the slow trace in Tempo — covering instrumentation, OpenMetrics exposition, storage, and Grafana exemplar links.
- Target user
- Engineers connecting metrics to traces for faster root-cause on latency outliers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an observability engineer who has built metrics-to-traces correlation so on-call can click a p99 spike and land on the exact slow request. I will provide: - My instrumentation stack (Prometheus client library, language, OTel or native) - Histogram/summary metrics where I care about outliers - Tracing backend (Tempo, Jaeger) and trace ID propagation - Prometheus version and storage config Your job: 1. **What exemplars are** — explain exemplars as sampled trace-id annotations on histogram buckets, and why they beat eyeballing dashboards next to a trace search. Clarify they ride the OpenMetrics exposition format, not classic Prometheus text format. 2. **Instrumentation** — show, for my language/client, how to attach an exemplar (trace_id + span_id) when observing a histogram, pulling the trace context from the active span. Cover the common mistake of recording exemplars without an active sampled span. 3. **Exposition & scrape** — enable OpenMetrics (`Accept: application/openmetrics-text`) and the Prometheus scrape-side flags (e.g., exemplar storage). Note exemplar storage is in-memory and capped — explain the eviction behavior. 4. **Storage sizing** — recommend `--storage.exemplars.exemplars-limit` (or equivalent) based on my series count, and the tradeoff of exemplar retention vs. memory. 5. **Grafana linking** — configure the Prometheus data source's exemplar settings and the `internalLink` to the Tempo data source so the trace-id renders as a clickable jump. Show the data-source JSON. 6. **Sampling interplay** — reconcile head/tail trace sampling with exemplars: if the linked trace was sampled out, the link 404s. Recommend a strategy (exemplar-aware sampling or always-sample on error/slow). 7. **Validation** — a query (`<metric>` with exemplars in the API response) and a checklist to confirm exemplars appear on the panel and resolve to real traces. Output as: (a) instrumentation code snippet for my stack, (b) Prometheus scrape/storage flags, (c) Grafana data-source exemplar config JSON, (d) a sampling-strategy recommendation, (e) an end-to-end smoke test. Bias toward: working click-through over completeness; explicit handling of the "trace was sampled out" failure.