Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

OpenTelemetry on Kubernetes Collector Design Prompt

Design and debug the OpenTelemetry Collector on Kubernetes — agent vs gateway, receivers/processors/exporters, sidecar vs DaemonSet, traces/metrics/logs pipelines.

Target user
Platform engineers running observability infrastructure
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has deployed OpenTelemetry Collector in production — agent + gateway pattern, sampling, exporters to multiple backends, debugging silent data drops.

I will provide:
- The deployment pattern (DaemonSet agent, Deployment gateway, sidecar)
- Collector config (receivers/processors/exporters/pipelines)
- The symptom (no data at backend, partial data, high CPU on collector, OOM)

Your job:

1. **Architecture choices**:
   - **DaemonSet agent** — one per node; receives from local pods; forwards to gateway
   - **Gateway** — central Deployment; receives from agents; processes; exports
   - **Sidecar** — per-pod collector; rare; for specific apps
   - Hybrid: agent + gateway is most common
2. **Collector components**:
   - **Receivers** — accept data (OTLP, Jaeger, Prometheus scrape, hostmetrics)
   - **Processors** — transform (batch, attributes, filter, tail sampling)
   - **Exporters** — send to backend (Jaeger, Tempo, Loki, OTLP gateway)
   - **Pipelines** — connect receivers → processors → exporters per signal (traces/metrics/logs)
3. **For "no data at backend"**:
   - Verify collector running + healthy
   - Check exporter logs for errors
   - Check receiver: is data arriving?
   - `pprof` and `zpages` for collector introspection
4. **For sampling**:
   - **head sampling** (probabilistic at client / agent)
   - **tail sampling** (gateway makes decision after full trace)
   - Tail sampling preserves interesting traces (errors, slow)
5. **For high cardinality**:
   - Metric attribute explosion = backend OOM
   - Use processor to drop / transform
   - Aggregate at collector
6. **For OTLP transport**:
   - HTTP or gRPC
   - TLS support
   - Compression (gzip)
7. **For multi-backend export**:
   - Multiple exporters in pipeline
   - Or multiple pipelines per signal
   - Queue / retry between collector and backend
8. **For Prometheus integration**:
   - prometheus receiver scrapes endpoints
   - prometheus exporter exposes metrics for Prometheus to scrape
   - prometheusremotewrite exporter pushes to Prometheus

Mark DESTRUCTIVE: dropping all data via wrong filter, sampling 0% (no data), collector OOM crash storm.

---

Pattern: [agent / gateway / sidecar / hybrid]
Collector config:
```yaml
[PASTE]
```
Symptom: [DESCRIBE]

Why this prompt works

OpenTelemetry is the modern observability layer but the collector has many moving parts. This prompt walks them.

How to use it

  1. Decide on architecture: agent + gateway is typical.
  2. For “no data”, trace from receiver to exporter.
  3. For sampling, define strategy upfront.
  4. For scale, batch + queue + memory budget.

Useful commands

# Collector pods
kubectl get pods -n observability -l app.kubernetes.io/name=opentelemetry-collector

# Logs
kubectl logs -n observability deploy/otel-gateway -f
kubectl logs -n observability ds/otel-agent -f

# Health
kubectl port-forward -n observability deploy/otel-gateway 13133:13133
curl http://localhost:13133/

# zpages
kubectl port-forward -n observability deploy/otel-gateway 55679:55679
# Visit localhost:55679/debug/tracez

# pprof
kubectl port-forward -n observability deploy/otel-gateway 1777:1777
go tool pprof http://localhost:1777/debug/pprof/heap

# Validate config
otelcol-contrib --config /etc/otelcol/config.yaml --feature-gates=+component.UseLocalHostAsDefaultHost

Patterns

Gateway collector (centralized)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 10000
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 512
  attributes:
    actions:
    - key: env
      value: production
      action: insert
  # Tail sampling (high-CPU; tune)
  tail_sampling:
    decision_wait: 30s
    policies:
    - name: errors
      type: status_code
      status_code: { status_codes: [ERROR] }
    - name: slow
      type: latency
      latency: { threshold_ms: 1000 }
    - name: sample-10%
      type: probabilistic
      probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo.observability:4317
    tls: { insecure: false }
  prometheusremotewrite:
    endpoint: https://prometheus.observability/api/v1/write
  loki:
    endpoint: https://loki.observability/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888
  extensions: [health_check, pprof, zpages]

Agent (DaemonSet)

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
  hostmetrics:
    collection_interval: 60s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
  k8s_cluster:
    auth_type: serviceAccount

processors:
  batch: {}
  k8sattributes:
    extract:
      metadata: [k8s.pod.name, k8s.namespace.name]

exporters:
  otlp/gateway:
    endpoint: otel-gateway.observability:4317
    tls: { insecure: true }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [k8sattributes, batch]
      exporters: [otlp/gateway]
    metrics:
      receivers: [otlp, hostmetrics]
      processors: [batch]
      exporters: [otlp/gateway]

Common findings this catches

  • No traces at backend → exporter error in collector logs.
  • Partial traces → sampling drop or queue overflow.
  • Collector OOM → batch + memory_limiter; reduce queue size.
  • High cardinality at backend → drop / aggregate at collector.
  • Spans missing parent → context propagation broken at app.
  • Agent can’t reach gateway → networkpolicy / service.
  • Slow exports back-pressuring → queue config + multiple workers.

When to escalate

  • Backend (Tempo / Jaeger / Prometheus) capacity issues → backend team.
  • Sampling design decisions — observability team.
  • Cross-cluster federation — coordinate.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.