OpenTelemetry Collector Backpressure: memory

The OpenTelemetry Collector is deceptively simple to deploy and surprisingly easy to OOM. The failure usually arrives during a telemetry spike: a deploy goes sideways, traces and metrics surge, the Collector’s memory climbs, and the container gets OOMKilled — taking your observability pipeline down at the exact moment you need it. The frustrating part is that the fix is almost always configuration, not capacity. It lives in three coupled settings — memory_limiter, batch, and the exporter sending queue — and in one detail the config format does nothing to enforce: processor order.

Order is part of the fix

The Collector runs processors in the order you list them, top to bottom, and that order is load-bearing. memory_limiter only protects the pipeline if it runs first, before any processor that allocates. Placed later, it watches the OOM happen instead of preventing it, because by the time it gets a chance to shed load, the upstream processors have already allocated the memory that killed the container. batch belongs at the end, just before export, where its job is to size egress to the backend — not in the middle where it buffers data the limiter is trying to drop.

processors:
  memory_limiter:        # MUST be first
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 512
  # ... any transform/filter processors here ...
  batch:                 # MUST be last, before export
    send_batch_size: 8192
    timeout: 5s

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Get the order wrong and every other setting in this post is moot.

How memory_limiter actually behaves

memory_limiter has two thresholds. At the soft limit (limit_mib minus spike_limit_mib) it starts refusing data, returning errors to receivers so they apply backpressure upstream. At the hard limit (limit_mib) it refuses aggressively and forces garbage collection. The key is to set these below the container’s memory limit, leaving headroom for the Go runtime and the exporter’s in-flight queue. If limit_mib equals the container limit, the Collector can still OOM in the gap between checks during a sharp spike.

memory_limiter:
  check_interval: 1s
  limit_mib: 1500       # well under a 2Gi container limit
  spike_limit_mib: 512  # soft limit kicks in at ~1000 MiB

The queue is where it gets multiplicative

Here’s the trap that catches teams who did configure memory_limiter: the exporter’s sending_queue. When the backend (Prometheus remote-write, an OTLP endpoint) slows down, the queue fills with batches waiting to send. That queued data is memory the limiter doesn’t directly govern, and it multiplies: queue size times batch size is the worst-case in-flight data. Size them independently and a spike can push total memory past the ceiling the limiter was supposed to protect.

exporters:
  prometheusremotewrite:
    endpoint: https://backend/api/v1/write
    sending_queue:
      enabled: true
      queue_size: 1000        # batches, not bytes
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_elapsed_time: 300s

The right mental model is: queue_size * send_batch_size worth of data must fit comfortably inside the memory the limiter is protecting, with room left over. When the backend is healthy the queue stays near empty; you’re sizing for the bad day, not the good one.

Letting AI size it, then verifying

This is a place where an assistant earns its keep, because the arithmetic ties together container memory, batch size, and queue depth — and the answer should be a formula, not a borrowed constant:

My OTel Collector OOMKills during deploy spikes. Container limit is 2Gi, I’m exporting metrics via remote-write, and my queue_size is 5000 with send_batch_size 8192. Tell me what’s wrong, fix the processor order if needed, and size memory_limiter, batch, and the queue so the worst case fits in the container.

Your queue_size: 5000 times send_batch_size: 8192 is a very large worst-case in-flight set that memory_limiter can’t govern — under a backend slowdown the queue fills and you blow past the container limit. Drop queue_size to ~1000, keep send_batch_size at 8192, set limit_mib: 1500 with spike_limit_mib: 512 on a 2Gi container, and confirm memory_limiter is the first processor in the pipeline. Then watch otelcol_exporter_queue_size and otelcol_processor_refused_* under load.

The model drafts the numbers and the reasoning; you verify against the Collector’s own telemetry. That self-monitoring step is non-negotiable, because backpressure is invisible until you watch it. The same draft-then-verify approach underpins the rest of the Prometheus and monitoring prompt library.

Watch the Collector watch itself

After tuning, alert on the Collector’s internal metrics:

otelcol_processor_refused_metric_points / _refused_spans — memory_limiter shedding load. Some is fine under spikes; a steady stream means you’re undersized.
otelcol_exporter_queue_size vs otelcol_exporter_queue_capacity — a queue near capacity means the backend can’t keep up.
otelcol_exporter_send_failed_* — data being dropped, not just delayed.
Process memory — confirm it stays under the container limit during your worst real spike.

If refused counts are zero and queue depth stays low under a real deploy spike, the tuning worked. If you’re still seeing drops, the bottleneck is downstream — the backend, not the Collector — and no amount of Collector tuning fixes that.

The bottom line

The OTel Collector OOMs for understandable, fixable reasons: memory_limiter placed anywhere but first, or a queue_size * batch_size product that exceeds the memory the limiter protects. Fix the order, size the three knobs as one coupled system against your real container limit, and prove it with the Collector’s own self-metrics. For a structured way to turn your resource limits into a config, the memory_limiter and batch sizing prompt and the OTel-to-Prometheus pipeline prompt both start from your numbers rather than a magic constant.

OpenTelemetry Collector Backpressure: memory_limiter, batch, and Queues