Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Grafana Tempo Distributed Tracing Prompt

Visualize traces in Grafana — Tempo data source, service graph, span metrics, trace search, OTLP integration.

Target user
SREs operating distributed tracing
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior SRE who has set up distributed tracing with Grafana Tempo — OTLP receivers, sampling, span metrics, service graph.

I will provide:
- Tracing setup (OTel collector, app instrumentation)
- Tempo deployment
- Symptom (traces missing, slow trace view, service graph empty)

Your job:

1. **Tempo architecture**:
   - **Distributor** — receives spans
   - **Ingester** — buffers, writes to S3
   - **Querier** — reads
   - **Compactor**
   - Single-binary OR microservices
2. **For ingest**:
   - OTLP gRPC/HTTP
   - Jaeger, Zipkin compatibility
   - From OTel Collector or directly
3. **For trace search**:
   - By traceID (fast)
   - By labels (slower)
   - TraceQL (newer)
4. **For service graph**:
   - Computed from spans
   - Shows service dependencies
   - Tempo metrics-generator → Prometheus
5. **For span metrics**:
   - Tempo generates Prometheus metrics from spans
   - request rate, error rate, duration
   - "RED metrics from traces"
6. **For sampling**:
   - Head sampling at app/agent
   - Tail sampling at OTel Collector
   - Trade-off: detail vs cost
7. **For retention**:
   - Per-tenant
   - S3-backed
   - Compactor manages
8. **For trace-to-logs**:
   - Trace view shows correlated logs
   - Derived from same traceID

Mark DESTRUCTIVE: removing sampling (cost explosion), retention reduction (data loss), trace data exposing PII.

---

Tracing setup: [DESCRIBE]
Tempo deployment: [DESCRIBE]
Symptom: [DESCRIBE]

Why this prompt works

Tempo is becoming standard. This prompt walks setup.

How to use it

  1. OTel Collector as ingest gateway.
  2. Sampling strategy upfront.
  3. Span metrics + service graph for observability.
  4. Correlation with logs / metrics.

Useful commands

# Tempo health
curl http://tempo:3200/ready
curl http://tempo:3200/metrics

# Test ingest
# Send a test span via OTLP
curl -X POST http://tempo:4318/v1/traces \
    -H "Content-Type: application/json" \
    -d '{"resourceSpans":[...]}'

# Search trace by ID
curl http://tempo:3200/api/traces/<traceID>

# TraceQL search
curl "http://tempo:3200/api/search?q={status=error}&start=$(date -d '1h ago' +%s)&end=$(date +%s)"

Tempo config (single-binary)

target: all

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc: { endpoint: 0.0.0.0:4317 }
        http: { endpoint: 0.0.0.0:4318 }

ingester:
  trace_idle_period: 10s
  max_block_duration: 5m

storage:
  trace:
    backend: s3
    s3:
      bucket: tempo-traces
      endpoint: s3.amazonaws.com
      region: us-east-1

compactor:
  compaction:
    block_retention: 720h        # 30 days

metrics_generator:
  registry:
    external_labels:
      cluster: prod
  storage:
    path: /var/tempo/metrics
    remote_write:
    - url: http://prometheus:9090/api/v1/write

overrides:
  metrics_generator_processors: [service-graphs, span-metrics]

Grafana Tempo datasource

datasources:
- name: Tempo
  type: tempo
  uid: tempo
  url: http://tempo:3200
  jsonData:
    tracesToLogs:
      datasourceUid: loki
      filterByTraceID: true
      tags: [cluster, namespace, pod]
    tracesToMetrics:
      datasourceUid: prometheus
      spanStartTimeShift: '-2m'
      spanEndTimeShift: '2m'
    serviceMap:
      datasourceUid: prometheus
    nodeGraph:
      enabled: true
    search:
      hide: false
    lokiSearch:
      datasourceUid: loki

Span metrics in Prometheus

# Request rate by service
sum by (service)(rate(traces_spanmetrics_calls_total[5m]))

# Error rate
sum by (service)(rate(traces_spanmetrics_calls_total{status_code="ERROR"}[5m]))
  / sum by (service)(rate(traces_spanmetrics_calls_total[5m]))

# Duration p99
histogram_quantile(0.99, sum by (service, le)(rate(traces_spanmetrics_latency_bucket[5m])))

Common findings this catches

  • No traces → ingester unhealthy or sampling drops all.
  • Trace not found → retention reached.
  • Service graph empty → metrics-generator not enabled.
  • Trace view slow → S3 backend latency.
  • Sampling drops too much → tune.
  • PII in spans → app instrumentation review.
  • Storage costs blowing up → tail sampling.

When to escalate

  • Sampling strategy design — coordinate.
  • Trace volume scaling — engineering.
  • Privacy review — security.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week