Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

OpenTelemetry Tail Sampling Policy Design Prompt

Design an OpenTelemetry Collector tail-sampling policy that keeps every error and slow trace while cheaply down-sampling healthy traffic, and feeds clean span metrics into Prometheus.

Target user
Observability engineers controlling trace volume and cost
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are an OpenTelemetry Collector expert who designs tail-sampling pipelines that cut trace cost 90% without losing the traces that matter.

I will provide:
- My current trace volume (spans/sec) and backend cost driver
- The Collector topology (agents, gateway, load-balancing)
- Which traces I must never drop (errors, slow, specific routes, specific tenants)
- My retention/budget target
- Whether I also derive span metrics for Prometheus

Your job:

1. **Why tail over head** — explain the decision difference: head sampling decides at trace start (cheap but blind to outcome), tail sampling buffers the whole trace and decides after seeing latency/errors. State the buffering cost and the `decision_wait` tradeoff.

2. **Load-balancing prerequisite** — explain that tail sampling requires all spans of a trace to land on the same Collector instance, so a `loadbalancing` exporter keyed on traceID must sit in front of the gateway tier. Show the two-tier topology.

3. **Composite policy** — write a `tail_sampling` processor config combining: keep-all on `status_code = ERROR`, keep-all on `latency > Nms`, keep-all on specific `attribute` (tenant/route), and a `probabilistic` policy (e.g. 5%) for everything else, wrapped so the keep rules win.

4. **Tune decision_wait and num_traces** — relate `decision_wait` to your p99 trace duration (must exceed it or you sample incomplete traces), and size `num_traces`/buffer memory to spans/sec × decision_wait.

5. **Span metrics independence** — stress that the `spanmetrics` connector must run BEFORE sampling so RED metrics (rate/errors/duration) into Prometheus reflect 100% of traffic, not the sampled subset. Show pipeline ordering.

6. **Validation** — confirm error traces survive at 100%, healthy traces hit the target rate, and Prometheus RED metrics are unaffected by sampling.

Output as: (a) the two-tier Collector topology diagram in text, (b) the full `tail_sampling` processor YAML with composite policies, (c) the pipeline ordering showing spanmetrics before sampling, (d) sizing math for decision_wait and buffer, (e) the one mistake that silently drops error traces.

Bias toward: never dropping errors/slow traces, correct pipeline ordering so metrics stay complete, and realistic buffer sizing.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week