You are a senior SRE who has set up distributed tracing with Grafana Tempo — OTLP receivers, sampling, span metrics, service graph. I will provide: - Tracing setup (OTel collector, app instrumentation) - Tempo deployment - Symptom (traces missing, slow trace view, service graph empty) Your job: 1. **Tempo architecture**: - **Distributor** — receives spans - **Ingester** — buffers, writes to S3 - **Querier** — reads - **Compactor** - Single-binary OR microservices 2. **For ingest**: - OTLP gRPC/HTTP - Jaeger, Zipkin compatibility - From OTel Collector or directly 3. **For trace search**: - By traceID (fast) - By labels (slower) - TraceQL (newer) 4. **For service graph**: - Computed from spans - Shows service dependencies - Tempo metrics-generator → Prometheus 5. **For span metrics**: - Tempo generates Prometheus metrics from spans - request rate, error rate, duration - "RED metrics from traces" 6. **For sampling**: - Head sampling at app/agent - Tail sampling at OTel Collector - Trade-off: detail vs cost 7. **For retention**: - Per-tenant - S3-backed - Compactor manages 8. **For trace-to-logs**: - Trace view shows correlated logs - Derived from same traceID Mark DESTRUCTIVE: removing sampling (cost explosion), retention reduction (data loss), trace data exposing PII. --- Tracing setup: [DESCRIBE] Tempo deployment: [DESCRIBE] Symptom: [DESCRIBE]

Why this prompt works

Tempo is becoming standard. This prompt walks setup.

How to use it

OTel Collector as ingest gateway.
Sampling strategy upfront.
Span metrics + service graph for observability.
Correlation with logs / metrics.

Useful commands

# Tempo health
curl http://tempo:3200/ready
curl http://tempo:3200/metrics

# Test ingest
# Send a test span via OTLP
curl -X POST http://tempo:4318/v1/traces \
    -H "Content-Type: application/json" \
    -d '{"resourceSpans":[...]}'

# Search trace by ID
curl http://tempo:3200/api/traces/<traceID>

# TraceQL search
curl "http://tempo:3200/api/search?q={status=error}&start=$(date -d '1h ago' +%s)&end=$(date +%s)"

Tempo config (single-binary)

target: all

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc: { endpoint: 0.0.0.0:4317 }
        http: { endpoint: 0.0.0.0:4318 }

ingester:
  trace_idle_period: 10s
  max_block_duration: 5m

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1

compactor:
  compaction:
    block_retention: 720h        # 30 days

metrics_generator:
  registry:
    external_labels:
      cluster: prod
  storage:
    path: /var/tempo/metrics
    remote_write:
    - url: http://prometheus:9090/api/v1/write

overrides:
  metrics_generator_processors: [service-graphs, span-metrics]

Grafana Tempo datasource

datasources:
- name: Tempo
  type: tempo
  uid: tempo
  url: http://tempo:3200
  jsonData:
    tracesToLogs:
      datasourceUid: loki
      filterByTraceID: true
      tags: [cluster, namespace, pod]
    tracesToMetrics:
      datasourceUid: prometheus
      spanStartTimeShift: '-2m'
      spanEndTimeShift: '2m'
    serviceMap:
      datasourceUid: prometheus
    nodeGraph:
      enabled: true
    search:
      hide: false
    lokiSearch:
      datasourceUid: loki

Span metrics in Prometheus

# Request rate by service
sum by (service)(rate(traces_spanmetrics_calls_total[5m]))

# Error rate
sum by (service)(rate(traces_spanmetrics_calls_total{status_code="ERROR"}[5m]))
  / sum by (service)(rate(traces_spanmetrics_calls_total[5m]))

# Duration p99
histogram_quantile(0.99, sum by (service, le)(rate(traces_spanmetrics_latency_bucket[5m])))

Common findings this catches

No traces → ingester unhealthy or sampling drops all.
Trace not found → retention reached.
Service graph empty → metrics-generator not enabled.
Trace view slow → S3 backend latency.
Sampling drops too much → tune.
PII in spans → app instrumentation review.
Storage costs blowing up → tail sampling.

When to escalate

Sampling strategy design — coordinate.
Trace volume scaling — engineering.
Privacy review — security.

Grafana Tempo Distributed Tracing Prompt

Why this prompt works

How to use it

Useful commands

Tempo config (single-binary)

Grafana Tempo datasource

Span metrics in Prometheus

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

Grafana Loki + Prometheus Correlation Prompt

OpenTelemetry on Kubernetes Collector Design Prompt

Why this prompt works

How to use it

Useful commands

Tempo config (single-binary)

Grafana Tempo datasource

Span metrics in Prometheus

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

Grafana Loki + Prometheus Correlation Prompt

OpenTelemetry on Kubernetes Collector Design Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet