Grafana Error Guide: 'trace not found'

Overview

You click a trace ID in Grafana Explore, or follow a trace-to-logs link, and the Tempo datasource returns nothing. Grafana renders “Trace not found” and the underlying HTTP call to Tempo’s query API returns a 404:

failed to get trace with id: 4bf92f3577b34da6a3ce929d0e0e4736 Status: 404 Not Found Body: trace not found

Direct against the API it looks like this:

GET /api/traces/4bf92f3577b34da6a3ce929d0e0e4736 -> 404
trace not found

The single most common cause is sampling: a head or tail sampler in your collector decided this trace was not interesting and never exported the spans, so Tempo never ingested it. The other causes are timing (the trace is still in the ingester and not yet flushed to object storage), retention (the block was deleted after block_retention), a malformed trace ID, or a misconfigured storage backend. This guide separates “never existed” from “existed but gone” from “exists but not where you looked”.

Symptoms

Grafana Explore shows “Trace not found” for a specific ID while other traces resolve fine.
Recent traces (seconds old) 404, then appear a minute or two later.
curl /api/traces/<id> returns 404 trace not found but TraceQL search still lists nearby traces.
Only a fraction of your requests produce retrievable traces (classic sampling signature).
Old traces 404 while new ones work — a retention/block_retention boundary.

Common Root Causes

1. Sampling dropped the trace (it was never ingested)

If your OpenTelemetry Collector or SDK sampled the request out, the spans were never sent to Tempo. No amount of querying will find them.

# otel-collector: tail sampling only keeps errors + slow traces
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 500 }

collector debug: traceID=4bf9... sampling_decision=NOT_SAMPLED policy=tail_sampling

A successful, fast request matches no policy and is dropped. This is the #1 cause of trace not found.

2. Ingester has not flushed to backend yet

Direct ID lookups hit the backend blocks. A brand-new trace lives in the ingester until complete_block_timeout elapses and it is flushed to object storage.

ingester:
  trace_idle_period: 10s
  max_block_duration: 5m
  complete_block_timeout: 15m

level=info msg="completing block" ingester traceID=4bf9... flushed=false

Until the block is cut and flushed, a cold ID lookup can 404 even though search finds the recent trace.

3. Trace outside block_retention (compacted away)

Tempo’s compactor deletes blocks older than block_retention (default 336h, 14 days).

compactor:
  compaction:
    block_retention: 336h   # 14 days
storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com

level=info component=compactor msg="deleting block" block=abcd age=337h retention=336h

4. Malformed trace ID or wrong tenant

Trace IDs are hex; leading zeros trimmed from a UI copy, or querying the wrong X-Scope-OrgID tenant, yields a clean 404.

curl -s -H 'X-Scope-OrgID: team-a' http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736 -o /dev/null -w '%{http_code}\n'

404   # correct id, wrong tenant -> not found

Diagnostic Workflow

Step 1 — Hit the Tempo query API directly, bypassing Grafana:

curl -s -H 'X-Scope-OrgID: fake' \
  http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736 \
  -o /dev/null -w 'HTTP %{http_code}\n'

Step 2 — Search recent traces with TraceQL to see whether ingestion is happening at all:

curl -sG http://tempo:3200/api/search \
  --data-urlencode 'q={ resource.service.name = "checkout" }' \
  --data-urlencode "start=$(date -d '-15 min' +%s)" \
  --data-urlencode "end=$(date +%s)" | jq '.traces | length'

Step 3 — Check whether the collector is sampling it out:

kubectl logs deploy/otel-collector -n observability | grep -i "sampling_decision"

Step 4 — Confirm the trace is not simply un-flushed by inspecting ingester logs and metrics:

kubectl logs deploy/tempo -n tracing | grep -i "completing block"
curl -s http://tempo:3200/metrics | grep -E 'tempo_ingester_blocks_flushed_total|tempo_request_duration'

Step 5 — Validate backend storage is reachable (S3/GCS creds and bucket):

kubectl logs deploy/tempo -n tracing | grep -iE 'storage|s3|bucket|access denied'

Example Root Cause Analysis

A developer copies a trace ID from an application log and pastes it into Grafana Explore’s Tempo datasource. Result: “Trace not found”. They assume Tempo is broken.

Running curl http://tempo:3200/api/traces/<id> directly returns 404 trace not found too, so it is not a Grafana proxy issue. A TraceQL search for the same service does return dozens of recent traces, proving ingestion works. The trace in question was a fast 200 OK on the health endpoint.

Checking the collector logs reveals sampling_decision=NOT_SAMPLED policy=tail_sampling for that trace ID — the tail sampler keeps only errors and slow requests, and this one was neither. The trace never reached Tempo. Nothing was lost or expired; it was intentionally dropped at the collector.

The fix is not on the Tempo side at all: the team adds a low-rate probabilistic policy so a baseline of “normal” traces is always retained, and documents that sampled-out traces will legitimately 404. For traces that do exist but 404 briefly right after creation, they explain the ingester flush delay to the on-call rotation.

Prevention Best Practices

Keep a baseline probabilistic sampling policy so “successful” traces are still retrievable, not only errors/slow ones.
Set block_retention to match your compliance/debugging window and monitor compactor deletion logs.
Educate users that fresh traces need a minute to flush from ingester to backend before a cold ID lookup works.
Standardize the correct X-Scope-OrgID tenant in the Grafana datasource so links resolve to the right tenant.
Alert on tempo_request_duration 5xx and storage access denied to catch backend misconfiguration early.
Wire trace-to-logs and derived fields carefully — see the Grafana troubleshooting guides.

Quick Command Reference

# Direct trace lookup (bypass Grafana)
curl -s -H 'X-Scope-OrgID: fake' \
  http://tempo:3200/api/traces/<traceID> -o /dev/null -w 'HTTP %{http_code}\n'

# TraceQL search to confirm ingestion
curl -sG http://tempo:3200/api/search \
  --data-urlencode 'q={ resource.service.name = "checkout" }' \
  --data-urlencode "start=$(date -d '-15 min' +%s)" \
  --data-urlencode "end=$(date +%s)" | jq '.traces | length'

# Was it sampled out?
kubectl logs deploy/otel-collector -n observability | grep -i sampling_decision

# Ingester flush + backend health
kubectl logs deploy/tempo -n tracing | grep -iE 'completing block|s3|access denied'
curl -s http://tempo:3200/metrics | grep tempo_ingester_blocks_flushed_total

Conclusion

Top root causes, in order of likelihood:

Sampling dropped the trace — head/tail sampling in the collector never exported the spans, so it was never ingested.
Ingester not yet flushed — a fresh trace lives in the ingester until complete_block_timeout; cold ID lookups 404 briefly.
Trace expired past block_retention (default 336h/14d) — the compactor deleted the block.
Wrong tenant or malformed trace ID — mismatched X-Scope-OrgID or trimmed leading hex zeros return a clean 404.
Backend storage misconfigured — S3/GCS bucket or credentials wrong, so no blocks are readable.

Grafana Error Guide: 'trace not found' — Tempo datasource 404 and sampling