AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

OpenTelemetry Tail Sampling Policy Design Prompt

Design an OpenTelemetry Collector tail-sampling policy that keeps every error and slow trace while cheaply down-sampling healthy traffic, and feeds clean span metrics into Prometheus.

Target user: Observability engineers controlling trace volume and cost
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an OpenTelemetry Collector expert who designs tail-sampling pipelines that cut trace cost 90% without losing the traces that matter.

I will provide:
- My current trace volume (spans/sec) and backend cost driver
- The Collector topology (agents, gateway, load-balancing)
- Which traces I must never drop (errors, slow, specific routes, specific tenants)
- My retention/budget target
- Whether I also derive span metrics for Prometheus

Your job:

1. **Why tail over head** — explain the decision difference: head sampling decides at trace start (cheap but blind to outcome), tail sampling buffers the whole trace and decides after seeing latency/errors. State the buffering cost and the `decision_wait` tradeoff.

2. **Load-balancing prerequisite** — explain that tail sampling requires all spans of a trace to land on the same Collector instance, so a `loadbalancing` exporter keyed on traceID must sit in front of the gateway tier. Show the two-tier topology.

3. **Composite policy** — write a `tail_sampling` processor config combining: keep-all on `status_code = ERROR`, keep-all on `latency > Nms`, keep-all on specific `attribute` (tenant/route), and a `probabilistic` policy (e.g. 5%) for everything else, wrapped so the keep rules win.

4. **Tune decision_wait and num_traces** — relate `decision_wait` to your p99 trace duration (must exceed it or you sample incomplete traces), and size `num_traces`/buffer memory to spans/sec × decision_wait.

5. **Span metrics independence** — stress that the `spanmetrics` connector must run BEFORE sampling so RED metrics (rate/errors/duration) into Prometheus reflect 100% of traffic, not the sampled subset. Show pipeline ordering.

6. **Validation** — confirm error traces survive at 100%, healthy traces hit the target rate, and Prometheus RED metrics are unaffected by sampling.

Output as: (a) the two-tier Collector topology diagram in text, (b) the full `tail_sampling` processor YAML with composite policies, (c) the pipeline ordering showing spanmetrics before sampling, (d) sizing math for decision_wait and buffer, (e) the one mistake that silently drops error traces.

Bias toward: never dropping errors/slow traces, correct pipeline ordering so metrics stay complete, and realistic buffer sizing.

Free: the DevOps AI Incident-Triage Cheat Sheet