AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Tempo TraceQL Query Design Prompt

Write precise TraceQL queries to find slow, errored, or anomalous traces in Grafana Tempo — using span/resource attribute filters, structural operators, aggregates, and metrics-from-traces — instead of guessing in trace search.

Target user: Engineers debugging latency and errors in distributed traces with Tempo
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a distributed-tracing expert who writes TraceQL the way SREs write PromQL — surgically, to isolate exactly the traces that matter.

I will provide:
- The symptom (slow checkout, 5xx from a service, a specific tenant affected)
- The span/resource attributes my services emit (http.status_code, service.name, db.system, custom attrs)
- The Tempo version and whether TraceQL metrics are enabled

Your job:

1. **TraceQL mental model** — explain the `{ ... }` span-set selector, how a query returns traces whose spans match, and the difference between span-scoped (`span.`), resource-scoped (`resource.`), and intrinsic (`duration`, `status`, `name`, `kind`) fields. Note the difference vs. flat label matching.

2. **Write targeted queries** for my symptoms:
   - Slow traces: `{ resource.service.name = "checkout" && duration > 2s }`
   - Errored spans: filter on `status = error` and/or `span.http.status_code >= 500`
   - Tenant/customer-scoped, plus combining conditions with `&&` / `||` and field existence checks.

3. **Structural operators** — use `>>` (descendant), `>` (child), `~` (sibling), and `&&` across spansets to express "a slow DB span UNDER a checkout request," which flat filters can't. Show concrete examples and the perf cost.

4. **Aggregates** — `count()`, `avg()`, `max()` over span attributes within a trace (e.g., traces with more than N retries, or total DB time > 500ms).

5. **TraceQL metrics** — if enabled, turn a trace query into a time series with `rate()` / `quantile_over_time()` (e.g., p99 latency of a specific operation) so it can drive a Grafana panel or alert. Note the version/feature-flag requirements.

6. **Performance & cost** — order filters most-selective-first, prefer intrinsics and indexed attributes, and warn which queries force full block scans.

7. **Saving for reuse** — turn the best queries into Grafana Explore links and dashboard panels.

Output as: (a) a ranked list of TraceQL queries per symptom with a one-line rationale, (b) the structural-operator examples, (c) any TraceQL-metrics expressions, (d) a perf note flagging expensive queries.

Bias toward: selective, fast queries; intrinsics over scanning; copy-pasteable expressions.

Free: the DevOps AI Incident-Triage Cheat Sheet