OpenTelemetry Collector batch & memory_limiter Processor Sizing Prompt

Size the OpenTelemetry Collector batch and memory_limiter processors so the pipeline batches efficiently, applies backpressure, and never OOMs under telemetry spikes.

Target user

Engineers running the OpenTelemetry Collector in front of Prometheus

Difficulty

Advanced

Tools

Claude, ChatGPT, Cursor

You are an OpenTelemetry Collector operator who understands that processor ORDER matters, that memory_limiter must be first to protect the pipeline, and that batch must come after it to size egress to the backend. I will provide: - The pipeline (receivers, processors, exporters) and current processor config: [PIPELINE CONFIG] - The Collector's memory limit / container resources: [RESOURCES] - The telemetry profile (metrics/traces/logs, steady rate, and spike behavior): [TELEMETRY PROFILE] - The symptom (OOMKilled, exporter queue full, dropped data, high backend write latency): [SYMPTOM] Your job: 1. **Fix processor order first** — confirm memory_limiter is the first processor in the pipeline (so it sheds load before downstream processors allocate) and batch is placed late, just before export. Explain why this order is non-negotiable. 2. **Size memory_limiter** — set limit_mib and spike_limit_mib relative to the container memory limit (leave headroom for the Go runtime and exporter queues), and explain what happens when each threshold is crossed (refuse data soft vs hard limit). Tie this to my OOM symptom. 3. **Size batch** — choose send_batch_size and timeout to match the backend's preferred write size (e.g. remote_write or OTLP backend), and explain the latency-vs-throughput trade-off of larger batches and longer timeouts. 4. **Tune exporter queue & retry** — recommend sending_queue and retry_on_failure settings so transient backend slowness applies backpressure rather than dropping or OOMing. Explain how queue size interacts with memory_limiter. 5. **Add observability** — list the Collector's own self-metrics to watch (refused/dropped spans, queue size, memory) so I can prove the change worked. Output as: (a) the corrected processor ordering, (b) memory_limiter and batch config with values tied to my resources via a formula, (c) exporter queue/retry config, (d) the 4-5 Collector self-metrics to alert on. No invented throughput numbers — show the formula. Always put memory_limiter first and batch last. Never recommend a queue or batch size that, multiplied out, exceeds the memory the limiter is protecting.

Why this prompt works

The OpenTelemetry Collector OOMs for boring, fixable reasons, but the fixes are counterintuitive because they depend on processor ordering — something the config format does not enforce. memory_limiter only protects the pipeline if it runs before the processors that allocate memory; placed late, it watches the OOM happen instead of preventing it. Likewise batch belongs at the end, sizing egress to the backend, not in the middle where it buffers data the limiter is trying to shed. This prompt makes ordering the first, non-negotiable step, which is exactly the detail most config snippets copied off the internet get wrong.

It also treats the three knobs — memory_limiter thresholds, batch size, and the exporter sending queue — as a single coupled system rather than independent settings. The real failure mode is multiplicative: a generous queue times a large batch can hold more in-flight data than the limiter is supposed to cap, so a spike blows past the ceiling and the container dies despite the limiter “being configured.” Forcing the model to size them together, against your actual container memory and via a formula rather than a borrowed constant, is what prevents the config that looks safe on paper and OOMs in production.

Finally, the prompt closes the loop with the Collector’s own self-telemetry. Backpressure is invisible until you watch refused/dropped counts, queue depth, and memory, so the answer includes the specific self-metrics to alert on. That is the verification half of AI-drafts, human-verifies: you apply the sizing, then prove it by watching the Collector report zero drops under a spike instead of assuming the new numbers worked.

OpenTelemetry Collector batch & memory_limiter Processor Sizing Prompt

Why this prompt works

Related prompts

OpenTelemetry Span Metrics Connector for RED Metrics Prompt

OpenTelemetry Collector to Prometheus Pipeline Prompt

Why this prompt works

Related prompts

OpenTelemetry Span Metrics Connector for RED Metrics Prompt

OpenTelemetry Collector to Prometheus Pipeline Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet