Prometheus Exemplars and Trace Links: Metrics to Traces

The most frustrating moment in observability is staring at a p99 latency spike on a histogram and having no way to ask “which request was that?” Metrics aggregate away the individual events by design — that’s what makes them cheap. Exemplars are the escape hatch: they attach a sample trace ID to a metric bucket so you can click from the aggregate straight to a concrete slow request. Once I wired exemplars up, my mean-time-to-cause dropped noticeably, because I stopped guessing.

What an exemplar actually is

An exemplar is a single example observation attached to a histogram bucket, carrying labels — most usefully a trace_id. When your instrumentation records a request that took 1.8 seconds, it bumps the appropriate histogram bucket and records “by the way, here’s a trace ID for one such request.” Prometheus stores these alongside the metric, and Grafana renders them as little diamonds on your graph that link out to the trace.

Crucially, exemplars are sampled, not exhaustive. You don’t get a trace ID for every request — you get representative ones. That keeps the cost bounded while still giving you a jump-off point for any visible spike.

Enabling exemplar storage in Prometheus

Exemplar storage is off by default. Turn it on:

# prometheus.yml
storage:
  exemplars:
    max_exemplars: 100000

Or via the flag --enable-feature=exemplar-storage on older versions. The store is a fixed-size circular buffer — max_exemplars caps memory, and old exemplars age out. You’re not keeping them forever; you’re keeping enough recent ones to investigate a live spike.

You also need to scrape exemplars, which the OpenMetrics exposition format carries. Modern Prometheus negotiates this automatically when the target exposes them.

Emitting exemplars from your app

The application has to attach the trace ID when it observes the histogram. With the Go client and OpenTelemetry trace context:

hist.(prometheus.ExemplarObserver).ObserveWithExemplar(
    elapsed.Seconds(),
    prometheus.Labels{"trace_id": traceID},
)

In the OpenMetrics text exposition this surfaces as the # exemplar suffix on a bucket line:

http_request_duration_seconds_bucket{le="2.0"} 1027 # {trace_id="a1b2c3d4"} 1.83 1718370000

That trailing fragment — {trace_id=...} <value> <timestamp> — is the exemplar. The value is the actual observation, and the trace_id is the link target.

If you instrument with the OpenTelemetry SDK and export to Prometheus, exemplar attachment from the active span context is largely automatic — the SDK pulls the current trace ID and attaches it when the sampler decides to record one.

Wiring the click-through in Grafana

This is where it pays off. Configure the Prometheus datasource to map the trace_id exemplar label to your tracing backend:

# Grafana datasource provisioning
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    jsonData:
      exemplarTraceIdDestinations:
        - name: trace_id
          datasourceUid: tempo      # your Tempo/Jaeger datasource UID

Now, on any histogram-based panel, enable “Exemplars” in the panel’s query options. Grafana draws diamond markers on the graph. Hover one and you see the trace_id; click it and Grafana opens that exact trace in Tempo. You’ve gone from “p99 is 1.8s” to “here is the span tree of one 1.8s request” in a single click.

The query side: histograms make this possible

Exemplars only attach to histograms, so your latency metric needs to be a histogram, not a gauge or summary. The standard p99 query that the exemplars decorate:

histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

When this line spikes, the exemplar diamonds sitting near the top of the spike are your sampled slow requests. Click the highest one and you’re looking at the worst-case trace for that window.

What this changes about debugging

Before exemplars, a latency investigation went: see spike, guess at a cause, add logging, redeploy, wait for it to happen again, read logs. Hours, sometimes days.

With exemplars: see spike, click diamond, read the trace, see that 1.4s of the 1.8s was spent in a downstream call to the inventory service waiting on a slow query. Minutes. The trace tells you where the time went, which the aggregate metric structurally cannot.

Practical gotchas

A few things that tripped me up:

Sampling alignment. If your trace sampler drops 99% of traces, the exemplar trace_id you click may point at a trace that wasn’t kept. Use tail-based sampling or bias the sampler to keep slow requests so your exemplars resolve to real traces.
Histogram required. Summaries (client-computed quantiles) can’t carry exemplars. If your latency metric is a summary, you can’t do this — switch to a histogram.
Memory is bounded but real. max_exemplars is a ring buffer; size it for your investigation window, not infinity.
Cardinality of exemplar labels. Keep exemplar labels minimal — trace_id and maybe one more. They’re not for slicing, just for linking.

Exemplars are one of those features that sound minor and change your whole debugging loop. The wiring is an afternoon; the payoff is every latency investigation thereafter.

For the tracing backend this links into, see our distributed-tracing and OpenTelemetry guides in the Prometheus and monitoring category. And when a latency alert fires, our monitoring alert assistant can help you turn the exemplar evidence into a tighter rule.

Exemplar support and Grafana wiring differ across versions. Verify the storage flag and datasource fields against your installed versions.