Grafana Error Guide: 'trace not found' — Tempo datasource 404 and sampling
Fix the Tempo 'trace not found' error in Grafana: check sampling drops, ingester flush lag, block_retention expiry, backend storage, and trace ID format.
- #grafana
- #troubleshooting
- #errors
- #tempo
- #tracing
Overview
You click a trace ID in Grafana Explore, or follow a trace-to-logs link, and the Tempo datasource returns nothing. Grafana renders “Trace not found” and the underlying HTTP call to Tempo’s query API returns a 404:
failed to get trace with id: 4bf92f3577b34da6a3ce929d0e0e4736 Status: 404 Not Found Body: trace not found
Direct against the API it looks like this:
GET /api/traces/4bf92f3577b34da6a3ce929d0e0e4736 -> 404
trace not found
The single most common cause is sampling: a head or tail sampler in your collector decided this trace was not interesting and never exported the spans, so Tempo never ingested it. The other causes are timing (the trace is still in the ingester and not yet flushed to object storage), retention (the block was deleted after block_retention), a malformed trace ID, or a misconfigured storage backend. This guide separates “never existed” from “existed but gone” from “exists but not where you looked”.
Symptoms
- Grafana Explore shows “Trace not found” for a specific ID while other traces resolve fine.
- Recent traces (seconds old) 404, then appear a minute or two later.
curl /api/traces/<id>returns404 trace not foundbut TraceQL search still lists nearby traces.- Only a fraction of your requests produce retrievable traces (classic sampling signature).
- Old traces 404 while new ones work — a retention/
block_retentionboundary.
Common Root Causes
1. Sampling dropped the trace (it was never ingested)
If your OpenTelemetry Collector or SDK sampled the request out, the spans were never sent to Tempo. No amount of querying will find them.
# otel-collector: tail sampling only keeps errors + slow traces
processors:
tail_sampling:
decision_wait: 10s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 500 }
collector debug: traceID=4bf9... sampling_decision=NOT_SAMPLED policy=tail_sampling
A successful, fast request matches no policy and is dropped. This is the #1 cause of trace not found.
2. Ingester has not flushed to backend yet
Direct ID lookups hit the backend blocks. A brand-new trace lives in the ingester until complete_block_timeout elapses and it is flushed to object storage.
ingester:
trace_idle_period: 10s
max_block_duration: 5m
complete_block_timeout: 15m
level=info msg="completing block" ingester traceID=4bf9... flushed=false
Until the block is cut and flushed, a cold ID lookup can 404 even though search finds the recent trace.
3. Trace outside block_retention (compacted away)
Tempo’s compactor deletes blocks older than block_retention (default 336h, 14 days).
compactor:
compaction:
block_retention: 336h # 14 days
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
level=info component=compactor msg="deleting block" block=abcd age=337h retention=336h
4. Malformed trace ID or wrong tenant
Trace IDs are hex; leading zeros trimmed from a UI copy, or querying the wrong X-Scope-OrgID tenant, yields a clean 404.
curl -s -H 'X-Scope-OrgID: team-a' http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736 -o /dev/null -w '%{http_code}\n'
404 # correct id, wrong tenant -> not found
Diagnostic Workflow
Step 1 — Hit the Tempo query API directly, bypassing Grafana:
curl -s -H 'X-Scope-OrgID: fake' \
http://tempo:3200/api/traces/4bf92f3577b34da6a3ce929d0e0e4736 \
-o /dev/null -w 'HTTP %{http_code}\n'
Step 2 — Search recent traces with TraceQL to see whether ingestion is happening at all:
curl -sG http://tempo:3200/api/search \
--data-urlencode 'q={ resource.service.name = "checkout" }' \
--data-urlencode "start=$(date -d '-15 min' +%s)" \
--data-urlencode "end=$(date +%s)" | jq '.traces | length'
Step 3 — Check whether the collector is sampling it out:
kubectl logs deploy/otel-collector -n observability | grep -i "sampling_decision"
Step 4 — Confirm the trace is not simply un-flushed by inspecting ingester logs and metrics:
kubectl logs deploy/tempo -n tracing | grep -i "completing block"
curl -s http://tempo:3200/metrics | grep -E 'tempo_ingester_blocks_flushed_total|tempo_request_duration'
Step 5 — Validate backend storage is reachable (S3/GCS creds and bucket):
kubectl logs deploy/tempo -n tracing | grep -iE 'storage|s3|bucket|access denied'
Example Root Cause Analysis
A developer copies a trace ID from an application log and pastes it into Grafana Explore’s Tempo datasource. Result: “Trace not found”. They assume Tempo is broken.
Running curl http://tempo:3200/api/traces/<id> directly returns 404 trace not found too, so it is not a Grafana proxy issue. A TraceQL search for the same service does return dozens of recent traces, proving ingestion works. The trace in question was a fast 200 OK on the health endpoint.
Checking the collector logs reveals sampling_decision=NOT_SAMPLED policy=tail_sampling for that trace ID — the tail sampler keeps only errors and slow requests, and this one was neither. The trace never reached Tempo. Nothing was lost or expired; it was intentionally dropped at the collector.
The fix is not on the Tempo side at all: the team adds a low-rate probabilistic policy so a baseline of “normal” traces is always retained, and documents that sampled-out traces will legitimately 404. For traces that do exist but 404 briefly right after creation, they explain the ingester flush delay to the on-call rotation.
Prevention Best Practices
- Keep a baseline
probabilisticsampling policy so “successful” traces are still retrievable, not only errors/slow ones. - Set
block_retentionto match your compliance/debugging window and monitor compactor deletion logs. - Educate users that fresh traces need a minute to flush from ingester to backend before a cold ID lookup works.
- Standardize the correct
X-Scope-OrgIDtenant in the Grafana datasource so links resolve to the right tenant. - Alert on
tempo_request_duration5xx and storageaccess deniedto catch backend misconfiguration early. - Wire trace-to-logs and derived fields carefully — see the Grafana troubleshooting guides.
Quick Command Reference
# Direct trace lookup (bypass Grafana)
curl -s -H 'X-Scope-OrgID: fake' \
http://tempo:3200/api/traces/<traceID> -o /dev/null -w 'HTTP %{http_code}\n'
# TraceQL search to confirm ingestion
curl -sG http://tempo:3200/api/search \
--data-urlencode 'q={ resource.service.name = "checkout" }' \
--data-urlencode "start=$(date -d '-15 min' +%s)" \
--data-urlencode "end=$(date +%s)" | jq '.traces | length'
# Was it sampled out?
kubectl logs deploy/otel-collector -n observability | grep -i sampling_decision
# Ingester flush + backend health
kubectl logs deploy/tempo -n tracing | grep -iE 'completing block|s3|access denied'
curl -s http://tempo:3200/metrics | grep tempo_ingester_blocks_flushed_total
Conclusion
Top root causes, in order of likelihood:
- Sampling dropped the trace — head/tail sampling in the collector never exported the spans, so it was never ingested.
- Ingester not yet flushed — a fresh trace lives in the ingester until
complete_block_timeout; cold ID lookups 404 briefly. - Trace expired past
block_retention(default 336h/14d) — the compactor deleted the block. - Wrong tenant or malformed trace ID — mismatched
X-Scope-OrgIDor trimmed leading hex zeros return a clean 404. - Backend storage misconfigured — S3/GCS bucket or credentials wrong, so no blocks are readable.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.