Grafana Tempo Distributed Tracing Prompt
Visualize traces in Grafana — Tempo data source, service graph, span metrics, trace search, OTLP integration.
- Target user
- SREs operating distributed tracing
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has set up distributed tracing with Grafana Tempo — OTLP receivers, sampling, span metrics, service graph. I will provide: - Tracing setup (OTel collector, app instrumentation) - Tempo deployment - Symptom (traces missing, slow trace view, service graph empty) Your job: 1. **Tempo architecture**: - **Distributor** — receives spans - **Ingester** — buffers, writes to S3 - **Querier** — reads - **Compactor** - Single-binary OR microservices 2. **For ingest**: - OTLP gRPC/HTTP - Jaeger, Zipkin compatibility - From OTel Collector or directly 3. **For trace search**: - By traceID (fast) - By labels (slower) - TraceQL (newer) 4. **For service graph**: - Computed from spans - Shows service dependencies - Tempo metrics-generator → Prometheus 5. **For span metrics**: - Tempo generates Prometheus metrics from spans - request rate, error rate, duration - "RED metrics from traces" 6. **For sampling**: - Head sampling at app/agent - Tail sampling at OTel Collector - Trade-off: detail vs cost 7. **For retention**: - Per-tenant - S3-backed - Compactor manages 8. **For trace-to-logs**: - Trace view shows correlated logs - Derived from same traceID Mark DESTRUCTIVE: removing sampling (cost explosion), retention reduction (data loss), trace data exposing PII. --- Tracing setup: [DESCRIBE] Tempo deployment: [DESCRIBE] Symptom: [DESCRIBE]
Why this prompt works
Tempo is becoming standard. This prompt walks setup.
How to use it
- OTel Collector as ingest gateway.
- Sampling strategy upfront.
- Span metrics + service graph for observability.
- Correlation with logs / metrics.
Useful commands
# Tempo health
curl http://tempo:3200/ready
curl http://tempo:3200/metrics
# Test ingest
# Send a test span via OTLP
curl -X POST http://tempo:4318/v1/traces \
-H "Content-Type: application/json" \
-d '{"resourceSpans":[...]}'
# Search trace by ID
curl http://tempo:3200/api/traces/<traceID>
# TraceQL search
curl "http://tempo:3200/api/search?q={status=error}&start=$(date -d '1h ago' +%s)&end=$(date +%s)"
Tempo config (single-binary)
target: all
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
ingester:
trace_idle_period: 10s
max_block_duration: 5m
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
compactor:
compaction:
block_retention: 720h # 30 days
metrics_generator:
registry:
external_labels:
cluster: prod
storage:
path: /var/tempo/metrics
remote_write:
- url: http://prometheus:9090/api/v1/write
overrides:
metrics_generator_processors: [service-graphs, span-metrics]
Grafana Tempo datasource
datasources:
- name: Tempo
type: tempo
uid: tempo
url: http://tempo:3200
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
tags: [cluster, namespace, pod]
tracesToMetrics:
datasourceUid: prometheus
spanStartTimeShift: '-2m'
spanEndTimeShift: '2m'
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: true
search:
hide: false
lokiSearch:
datasourceUid: loki
Span metrics in Prometheus
# Request rate by service
sum by (service)(rate(traces_spanmetrics_calls_total[5m]))
# Error rate
sum by (service)(rate(traces_spanmetrics_calls_total{status_code="ERROR"}[5m]))
/ sum by (service)(rate(traces_spanmetrics_calls_total[5m]))
# Duration p99
histogram_quantile(0.99, sum by (service, le)(rate(traces_spanmetrics_latency_bucket[5m])))
Common findings this catches
- No traces → ingester unhealthy or sampling drops all.
- Trace not found → retention reached.
- Service graph empty → metrics-generator not enabled.
- Trace view slow → S3 backend latency.
- Sampling drops too much → tune.
- PII in spans → app instrumentation review.
- Storage costs blowing up → tail sampling.
When to escalate
- Sampling strategy design — coordinate.
- Trace volume scaling — engineering.
- Privacy review — security.
Related prompts
-
Alert Fatigue Reduction Strategy Prompt
Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.
-
Grafana Loki + Prometheus Correlation Prompt
Correlate metrics and logs in Grafana — exemplars from Prometheus to traces, derived fields from Loki, jump from spike to log line.
-
OpenTelemetry on Kubernetes Collector Design Prompt
Design and debug the OpenTelemetry Collector on Kubernetes — agent vs gateway, receivers/processors/exporters, sidecar vs DaemonSet, traces/metrics/logs pipelines.