Distributed Tracing With Grafana Tempo Alongside Prometheus

Prometheus is brilliant at telling you that checkout p99 jumped to two seconds. It is useless at telling you which downstream call ate that time. For years I closed that gap by grepping logs and guessing. Grafana Tempo, sitting next to Prometheus, closes it properly: you click a spike on a latency graph and land on the actual trace that produced it.

This is how I run Tempo as a companion to Prometheus, not a replacement for it.

Why Tempo and not “just more metrics”

You could add more histogram buckets and per-dependency timers forever, and you’d still be approximating. Traces are the ground truth of a single request’s path through your system. Tempo’s pitch is specifically that it’s cheap to operate: it indexes only the trace ID and stores spans in object storage (S3/GCS), so you keep a lot of traces without a heavyweight index. That cost profile is what makes “trace everything, sample on read” realistic.

The division of labor I aim for:

Prometheus answers “is it bad, and how bad” across all requests (RED metrics).
Tempo answers “for this slow request, where did the time go.”
Exemplars are the bridge between them.

Getting spans in: the same Collector you already run

If you’ve deployed the OpenTelemetry Collector for metrics, traces are nearly free — it’s another pipeline in the same process:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]

Your instrumented services emit OTLP spans, the Collector batches them, Tempo stores them. No separate agent, no second protocol to babysit.

Exemplars: the click that saves the night

The feature that makes this combo worth it is exemplars. An exemplar is a trace ID attached to a metric sample — so a point on your latency histogram knows about an example request that produced it. In Grafana, that turns into a little diamond on the graph you can click to open the trace.

To get them, your instrumentation must emit exemplars (most OTEL SDKs do when tracing is enabled), and Prometheus must store them. Enable the feature flag:

# prometheus startup flag
--enable-feature=exemplar-storage

# prometheus.yml
storage:
  exemplars:
    max_exemplars: 100000

Then a histogram query in Grafana shows exemplar diamonds. The workflow becomes: spot the p99 spike in a Prometheus panel, click the highest exemplar diamond on it, read the trace in Tempo. Two clicks from “something is slow” to “this span took 1.8s waiting on the payments gRPC call.”

Generate metrics from traces

Tempo’s metrics-generator can produce RED metrics and a service graph directly from spans, then remote-write them into Prometheus:

# tempo.yaml
metrics_generator:
  registry:
    external_labels:
      source: tempo
  storage:
    remote_write:
      - url: http://prometheus:9090/api/v1/write
  processor:
    service_graphs:
      enabled: true
    span_metrics:
      enabled: true

This gives you traces_spanmetrics_latency and a service-graph view of who-calls-whom, derived from real traffic rather than a hand-drawn architecture diagram that’s six months stale. Query it like any other Prometheus metric:

histogram_quantile(0.99,
  sum by (le, service) (
    rate(traces_spanmetrics_latency_bucket[5m])
  )
)

Sampling: keep what’s interesting

You almost never want to store 100% of traces at scale. Tail sampling — deciding after a trace completes — lets you keep the ones that matter: anything with an error, anything slow, and a small baseline of normal traffic for context.

processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

Tail sampling has to happen on a Collector that sees the whole trace, so it lives on the gateway tier, not the per-node agents. That’s the one topology constraint people trip on.

What to actually alert on

Resist the urge to alert on traces directly. Alert on the metrics (Prometheus), and use traces to investigate the page. The healthy pattern:

Prometheus alert fires on RED metrics — error rate or p99 latency over an SLO threshold.
The alert links to a Grafana dashboard with exemplars enabled.
The on-call clicks an exemplar and reads the offending trace.

That keeps your alerting deterministic and cheap while still giving you depth on demand. If you’re tuning where alerts go, our monitoring alert routing write-up pairs well with this.

Retention and cost reality

Tempo stores in object storage, so retention is a bucket lifecycle policy, not a disk-sizing exercise. I keep error and slow traces longer than baseline by routing them to different storage tiers where the backend supports it. Thirty days of “interesting” traces and a few days of everything else covers most debugging, and the bill stays sane.

The bottom line

Tempo doesn’t replace Prometheus — it answers the question Prometheus can’t. Send spans through the Collector you already run, turn on exemplars so a metric spike is one click from its trace, generate RED metrics and a service graph from real traffic, and sample on the tail so you keep what’s worth keeping. For more on the metrics half of this stack, start at the Prometheus & Monitoring category.

Tracing configurations differ across Tempo and Collector versions. Validate against your deployment and the official Grafana docs before production use.