You are a senior platform engineer who has deployed OpenTelemetry Collector in production — agent + gateway pattern, sampling, exporters to multiple backends, debugging silent data drops. I will provide: - The deployment pattern (DaemonSet agent, Deployment gateway, sidecar) - Collector config (receivers/processors/exporters/pipelines) - The symptom (no data at backend, partial data, high CPU on collector, OOM) Your job: 1. **Architecture choices**: - **DaemonSet agent** — one per node; receives from local pods; forwards to gateway - **Gateway** — central Deployment; receives from agents; processes; exports - **Sidecar** — per-pod collector; rare; for specific apps - Hybrid: agent + gateway is most common 2. **Collector components**: - **Receivers** — accept data (OTLP, Jaeger, Prometheus scrape, hostmetrics) - **Processors** — transform (batch, attributes, filter, tail sampling) - **Exporters** — send to backend (Jaeger, Tempo, Loki, OTLP gateway) - **Pipelines** — connect receivers → processors → exporters per signal (traces/metrics/logs) 3. **For "no data at backend"**: - Verify collector running + healthy - Check exporter logs for errors - Check receiver: is data arriving? - `pprof` and `zpages` for collector introspection 4. **For sampling**: - **head sampling** (probabilistic at client / agent) - **tail sampling** (gateway makes decision after full trace) - Tail sampling preserves interesting traces (errors, slow) 5. **For high cardinality**: - Metric attribute explosion = backend OOM - Use processor to drop / transform - Aggregate at collector 6. **For OTLP transport**: - HTTP or gRPC - TLS support - Compression (gzip) 7. **For multi-backend export**: - Multiple exporters in pipeline - Or multiple pipelines per signal - Queue / retry between collector and backend 8. **For Prometheus integration**: - prometheus receiver scrapes endpoints - prometheus exporter exposes metrics for Prometheus to scrape - prometheusremotewrite exporter pushes to Prometheus Mark DESTRUCTIVE: dropping all data via wrong filter, sampling 0% (no data), collector OOM crash storm. --- Pattern: [agent / gateway / sidecar / hybrid] Collector config: ```yaml [PASTE] ``` Symptom: [DESCRIBE]

Why this prompt works

OpenTelemetry is the modern observability layer but the collector has many moving parts. This prompt walks them.

How to use it

Decide on architecture: agent + gateway is typical.
For “no data”, trace from receiver to exporter.
For sampling, define strategy upfront.
For scale, batch + queue + memory budget.

Useful commands

# Collector pods
kubectl get pods -n observability -l app.kubernetes.io/name=opentelemetry-collector

# Logs
kubectl logs -n observability deploy/otel-gateway -f
kubectl logs -n observability ds/otel-agent -f

# Health
kubectl port-forward -n observability deploy/otel-gateway 13133:13133
curl http://localhost:13133/

# zpages
kubectl port-forward -n observability deploy/otel-gateway 55679:55679
# Visit localhost:55679/debug/tracez

# pprof
kubectl port-forward -n observability deploy/otel-gateway 1777:1777
go tool pprof http://localhost:1777/debug/pprof/heap

# Validate config
otelcol-contrib --config /etc/otelcol/config.yaml --feature-gates=+component.UseLocalHostAsDefaultHost

Patterns

Gateway collector (centralized)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    send_batch_size: 10000
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 512
  attributes:
    actions:
    - key: env
      value: production
      action: insert
  # Tail sampling (high-CPU; tune)
  tail_sampling:
    decision_wait: 30s
    policies:
    - name: errors
      type: status_code
      status_code: { status_codes: [ERROR] }
    - name: slow
      type: latency
      latency: { threshold_ms: 1000 }
    - name: sample-10%
      type: probabilistic
      probabilistic: { sampling_percentage: 10 }

exporters:
  otlp/tempo:
    endpoint: tempo.observability:4317
    tls: { insecure: false }
  prometheusremotewrite:
    endpoint: https://prometheus.observability/api/v1/write
  loki:
    endpoint: https://loki.observability/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888
  extensions: [health_check, pprof, zpages]

Agent (DaemonSet)

receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
  hostmetrics:
    collection_interval: 60s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
  k8s_cluster:
    auth_type: serviceAccount

processors:
  batch: {}
  k8sattributes:
    extract:
      metadata: [k8s.pod.name, k8s.namespace.name]

exporters:
  otlp/gateway:
    endpoint: otel-gateway.observability:4317
    tls: { insecure: true }

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [k8sattributes, batch]
      exporters: [otlp/gateway]
    metrics:
      receivers: [otlp, hostmetrics]
      processors: [batch]
      exporters: [otlp/gateway]

Common findings this catches

No traces at backend → exporter error in collector logs.
Partial traces → sampling drop or queue overflow.
Collector OOM → batch + memory_limiter; reduce queue size.
High cardinality at backend → drop / aggregate at collector.
Spans missing parent → context propagation broken at app.
Agent can’t reach gateway → networkpolicy / service.
Slow exports back-pressuring → queue config + multiple workers.

When to escalate

Backend (Tempo / Jaeger / Prometheus) capacity issues → backend team.
Sampling design decisions — observability team.
Cross-cluster federation — coordinate.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

OpenTelemetry on Kubernetes Collector Design Prompt

Why this prompt works

How to use it

Useful commands

Patterns

Gateway collector (centralized)

Agent (DaemonSet)

Common findings this catches

When to escalate

Related prompts

Prometheus ServiceMonitor & PodMonitor Configuration Prompt

Kubernetes Events Analysis Prompt

Kubernetes Native Sidecar Containers Prompt

Kubernetes Downward API Pod Metadata Exposure Prompt

Reading prompts? Get all 500 in one free PDF