OpenTelemetry Collector Backpressure: memory_limiter, batch, and Queues
The OTel Collector OOMs for fixable reasons rooted in processor order and queue sizing. Here's how memory_limiter, batch, and the exporter queue interact under load.
- #prometheus-monitoring
- #ai
- #opentelemetry
- #collector
- #backpressure
The OpenTelemetry Collector is deceptively simple to deploy and surprisingly easy to OOM. The failure usually arrives during a telemetry spike: a deploy goes sideways, traces and metrics surge, the Collector’s memory climbs, and the container gets OOMKilled — taking your observability pipeline down at the exact moment you need it. The frustrating part is that the fix is almost always configuration, not capacity. It lives in three coupled settings — memory_limiter, batch, and the exporter sending queue — and in one detail the config format does nothing to enforce: processor order.
Order is part of the fix
The Collector runs processors in the order you list them, top to bottom, and that order is load-bearing. memory_limiter only protects the pipeline if it runs first, before any processor that allocates. Placed later, it watches the OOM happen instead of preventing it, because by the time it gets a chance to shed load, the upstream processors have already allocated the memory that killed the container. batch belongs at the end, just before export, where its job is to size egress to the backend — not in the middle where it buffers data the limiter is trying to drop.
processors:
memory_limiter: # MUST be first
check_interval: 1s
limit_mib: 1500
spike_limit_mib: 512
# ... any transform/filter processors here ...
batch: # MUST be last, before export
send_batch_size: 8192
timeout: 5s
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
Get the order wrong and every other setting in this post is moot.
How memory_limiter actually behaves
memory_limiter has two thresholds. At the soft limit (limit_mib minus spike_limit_mib) it starts refusing data, returning errors to receivers so they apply backpressure upstream. At the hard limit (limit_mib) it refuses aggressively and forces garbage collection. The key is to set these below the container’s memory limit, leaving headroom for the Go runtime and the exporter’s in-flight queue. If limit_mib equals the container limit, the Collector can still OOM in the gap between checks during a sharp spike.
memory_limiter:
check_interval: 1s
limit_mib: 1500 # well under a 2Gi container limit
spike_limit_mib: 512 # soft limit kicks in at ~1000 MiB
The queue is where it gets multiplicative
Here’s the trap that catches teams who did configure memory_limiter: the exporter’s sending_queue. When the backend (Prometheus remote-write, an OTLP endpoint) slows down, the queue fills with batches waiting to send. That queued data is memory the limiter doesn’t directly govern, and it multiplies: queue size times batch size is the worst-case in-flight data. Size them independently and a spike can push total memory past the ceiling the limiter was supposed to protect.
exporters:
prometheusremotewrite:
endpoint: https://backend/api/v1/write
sending_queue:
enabled: true
queue_size: 1000 # batches, not bytes
retry_on_failure:
enabled: true
initial_interval: 5s
max_elapsed_time: 300s
The right mental model is: queue_size * send_batch_size worth of data must fit comfortably inside the memory the limiter is protecting, with room left over. When the backend is healthy the queue stays near empty; you’re sizing for the bad day, not the good one.
Letting AI size it, then verifying
This is a place where an assistant earns its keep, because the arithmetic ties together container memory, batch size, and queue depth — and the answer should be a formula, not a borrowed constant:
My OTel Collector OOMKills during deploy spikes. Container limit is 2Gi, I’m exporting metrics via remote-write, and my queue_size is 5000 with send_batch_size 8192. Tell me what’s wrong, fix the processor order if needed, and size memory_limiter, batch, and the queue so the worst case fits in the container.
Your
queue_size: 5000timessend_batch_size: 8192is a very large worst-case in-flight set thatmemory_limitercan’t govern — under a backend slowdown the queue fills and you blow past the container limit. Dropqueue_sizeto ~1000, keepsend_batch_sizeat 8192, setlimit_mib: 1500withspike_limit_mib: 512on a 2Gi container, and confirmmemory_limiteris the first processor in the pipeline. Then watchotelcol_exporter_queue_sizeandotelcol_processor_refused_*under load.
The model drafts the numbers and the reasoning; you verify against the Collector’s own telemetry. That self-monitoring step is non-negotiable, because backpressure is invisible until you watch it. The same draft-then-verify approach underpins the rest of the Prometheus and monitoring prompt library.
Watch the Collector watch itself
After tuning, alert on the Collector’s internal metrics:
otelcol_processor_refused_metric_points/_refused_spans—memory_limitershedding load. Some is fine under spikes; a steady stream means you’re undersized.otelcol_exporter_queue_sizevsotelcol_exporter_queue_capacity— a queue near capacity means the backend can’t keep up.otelcol_exporter_send_failed_*— data being dropped, not just delayed.- Process memory — confirm it stays under the container limit during your worst real spike.
If refused counts are zero and queue depth stays low under a real deploy spike, the tuning worked. If you’re still seeing drops, the bottleneck is downstream — the backend, not the Collector — and no amount of Collector tuning fixes that.
The bottom line
The OTel Collector OOMs for understandable, fixable reasons: memory_limiter placed anywhere but first, or a queue_size * batch_size product that exceeds the memory the limiter protects. Fix the order, size the three knobs as one coupled system against your real container limit, and prove it with the Collector’s own self-metrics. For a structured way to turn your resource limits into a config, the memory_limiter and batch sizing prompt and the OTel-to-Prometheus pipeline prompt both start from your numbers rather than a magic constant.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.