OpenTelemetry on Kubernetes Collector Design Prompt
Design and debug the OpenTelemetry Collector on Kubernetes — agent vs gateway, receivers/processors/exporters, sidecar vs DaemonSet, traces/metrics/logs pipelines.
- Target user
- Platform engineers running observability infrastructure
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has deployed OpenTelemetry Collector in production — agent + gateway pattern, sampling, exporters to multiple backends, debugging silent data drops. I will provide: - The deployment pattern (DaemonSet agent, Deployment gateway, sidecar) - Collector config (receivers/processors/exporters/pipelines) - The symptom (no data at backend, partial data, high CPU on collector, OOM) Your job: 1. **Architecture choices**: - **DaemonSet agent** — one per node; receives from local pods; forwards to gateway - **Gateway** — central Deployment; receives from agents; processes; exports - **Sidecar** — per-pod collector; rare; for specific apps - Hybrid: agent + gateway is most common 2. **Collector components**: - **Receivers** — accept data (OTLP, Jaeger, Prometheus scrape, hostmetrics) - **Processors** — transform (batch, attributes, filter, tail sampling) - **Exporters** — send to backend (Jaeger, Tempo, Loki, OTLP gateway) - **Pipelines** — connect receivers → processors → exporters per signal (traces/metrics/logs) 3. **For "no data at backend"**: - Verify collector running + healthy - Check exporter logs for errors - Check receiver: is data arriving? - `pprof` and `zpages` for collector introspection 4. **For sampling**: - **head sampling** (probabilistic at client / agent) - **tail sampling** (gateway makes decision after full trace) - Tail sampling preserves interesting traces (errors, slow) 5. **For high cardinality**: - Metric attribute explosion = backend OOM - Use processor to drop / transform - Aggregate at collector 6. **For OTLP transport**: - HTTP or gRPC - TLS support - Compression (gzip) 7. **For multi-backend export**: - Multiple exporters in pipeline - Or multiple pipelines per signal - Queue / retry between collector and backend 8. **For Prometheus integration**: - prometheus receiver scrapes endpoints - prometheus exporter exposes metrics for Prometheus to scrape - prometheusremotewrite exporter pushes to Prometheus Mark DESTRUCTIVE: dropping all data via wrong filter, sampling 0% (no data), collector OOM crash storm. --- Pattern: [agent / gateway / sidecar / hybrid] Collector config: ```yaml [PASTE] ``` Symptom: [DESCRIBE]
Why this prompt works
OpenTelemetry is the modern observability layer but the collector has many moving parts. This prompt walks them.
How to use it
- Decide on architecture: agent + gateway is typical.
- For “no data”, trace from receiver to exporter.
- For sampling, define strategy upfront.
- For scale, batch + queue + memory budget.
Useful commands
# Collector pods
kubectl get pods -n observability -l app.kubernetes.io/name=opentelemetry-collector
# Logs
kubectl logs -n observability deploy/otel-gateway -f
kubectl logs -n observability ds/otel-agent -f
# Health
kubectl port-forward -n observability deploy/otel-gateway 13133:13133
curl http://localhost:13133/
# zpages
kubectl port-forward -n observability deploy/otel-gateway 55679:55679
# Visit localhost:55679/debug/tracez
# pprof
kubectl port-forward -n observability deploy/otel-gateway 1777:1777
go tool pprof http://localhost:1777/debug/pprof/heap
# Validate config
otelcol-contrib --config /etc/otelcol/config.yaml --feature-gates=+component.UseLocalHostAsDefaultHost
Patterns
Gateway collector (centralized)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 10000
timeout: 10s
memory_limiter:
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 512
attributes:
actions:
- key: env
value: production
action: insert
# Tail sampling (high-CPU; tune)
tail_sampling:
decision_wait: 30s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: sample-10%
type: probabilistic
probabilistic: { sampling_percentage: 10 }
exporters:
otlp/tempo:
endpoint: tempo.observability:4317
tls: { insecure: false }
prometheusremotewrite:
endpoint: https://prometheus.observability/api/v1/write
loki:
endpoint: https://loki.observability/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes, tail_sampling]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888
extensions: [health_check, pprof, zpages]
Agent (DaemonSet)
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
hostmetrics:
collection_interval: 60s
scrapers:
cpu: {}
memory: {}
disk: {}
k8s_cluster:
auth_type: serviceAccount
processors:
batch: {}
k8sattributes:
extract:
metadata: [k8s.pod.name, k8s.namespace.name]
exporters:
otlp/gateway:
endpoint: otel-gateway.observability:4317
tls: { insecure: true }
service:
pipelines:
traces:
receivers: [otlp]
processors: [k8sattributes, batch]
exporters: [otlp/gateway]
metrics:
receivers: [otlp, hostmetrics]
processors: [batch]
exporters: [otlp/gateway]
Common findings this catches
- No traces at backend → exporter error in collector logs.
- Partial traces → sampling drop or queue overflow.
- Collector OOM → batch + memory_limiter; reduce queue size.
- High cardinality at backend → drop / aggregate at collector.
- Spans missing parent → context propagation broken at app.
- Agent can’t reach gateway → networkpolicy / service.
- Slow exports back-pressuring → queue config + multiple workers.
When to escalate
- Backend (Tempo / Jaeger / Prometheus) capacity issues → backend team.
- Sampling design decisions — observability team.
- Cross-cluster federation — coordinate.
Related prompts
-
Kubernetes Events Analysis Prompt
Filter, aggregate, and decode Kubernetes events — FailedScheduling, BackOff, ProvisioningFailed — to diagnose cluster-wide issues from noisy event streams.
-
Kubernetes Native Sidecar Containers Prompt
Migrate to native sidecar containers (1.28+) — `initContainers` with `restartPolicy: Always`, ordering, graceful shutdown, common patterns (service mesh, log shipper).
-
Prometheus ServiceMonitor & PodMonitor Configuration Prompt
Configure Prometheus Operator scrape — ServiceMonitor, PodMonitor, target discovery, label rewriting, missing metrics debugging.