Instrumenting Services With the OpenTelemetry Collector for

For years my metrics pipeline was a pile of bespoke exporters, a scrape config that nobody fully understood, and a quiet prayer that the right /metrics endpoint was reachable. The OpenTelemetry Collector replaced most of that with one process that receives, processes, and exports telemetry in a vendor-neutral way. It’s the most useful box in a modern monitoring stack — and the easiest to misconfigure in a way that silently drops data.

This is how I wire the Collector into Prometheus without surprises.

The mental model: receivers, processors, exporters

The Collector is a pipeline. Data comes in through receivers, passes through processors, and leaves through exporters. You compose those into pipelines by signal type (metrics, traces, logs). Get that model straight and the YAML stops being scary.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'app'
          scrape_interval: 15s
          static_configs:
            - targets: ['app:8080']

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25

exporters:
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Notice the Collector can both receive OTLP from instrumented apps and scrape Prometheus targets. That dual role is what lets you migrate a fleet gradually instead of in a big-bang cutover.

Pull vs push: which exporter do I actually want?

There are two ways to get metrics into Prometheus from the Collector, and people pick the wrong one constantly.

prometheus exporter — exposes a /metrics endpoint that Prometheus scrapes. Keeps the pull model. Best when Prometheus already owns service discovery.
prometheusremotewrite exporter — pushes to Prometheus’s remote-write endpoint (or Mimir/Thanos Receive). Best when the Collector lives at the edge and central Prometheus can’t reach back.

If you’re not sure, start with the prometheus exporter and keep the pull model. You’ll inherit Prometheus’s staleness handling and up metric for free, which matters more than people expect.

The memory_limiter is not optional

The single most common Collector outage I’ve seen is OOM. A burst of telemetry arrives, the Collector buffers it, memory balloons, the kernel kills the process, and you lose everything in flight. The memory_limiter processor must be first in every pipeline so it can shed load before the box dies:

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25

Put batch after it. Order matters: limiter first to protect the process, batch last to amortize export overhead.

Transform and drop before you pay for it

The Collector is where you fix cardinality problems instead of paying for them downstream. Drop noisy labels and series at the edge:

processors:
  attributes/cleanup:
    actions:
      - key: http.user_agent
        action: delete
      - key: pod.uid
        action: delete
  filter/drop_debug:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - "go_gc.*"

A pod.uid or a raw user_agent becomes a label, and every unique value is a new time series. Deleting them here is far cheaper than chasing the cardinality explosion later. If you’ve ever fought a runaway TSDB, see our notes on taming Prometheus metric cardinality.

Deployment topology: agent plus gateway

For anything past a handful of nodes I run two tiers:

Agent Collectors as a DaemonSet, one per node, scraping local targets and receiving OTLP from co-located apps. They do cheap work: receive, light filtering, forward.
Gateway Collectors as a horizontally-scaled Deployment that does the expensive work — tail sampling, heavy transforms, remote-write batching — and exports to Prometheus.

This keeps per-node resource use predictable and centralizes the parts that need a global view. Point agents at the gateway with the otlp exporter over gRPC.

Verify it actually works

Don’t trust YAML — trust the Collector’s own telemetry. It exposes internal metrics about how much it received, dropped, and exported:

# Are we dropping data anywhere?
rate(otelcol_processor_dropped_metric_points[5m]) > 0

# Is remote-write failing?
rate(otelcol_exporter_send_failed_metric_points[5m]) > 0

# Refused at the receiver (backpressure)?
rate(otelcol_receiver_refused_metric_points[5m]) > 0

Alert on all three. A Collector that’s quietly refusing or dropping points is worse than one that’s down, because the dashboards still look green. I keep these on a dedicated panel right next to my monitoring alert routing view so a degraded pipeline is obvious at a glance.

A migration path that doesn’t break anything

You rarely get to rebuild from scratch. The low-risk sequence I use:

Deploy the Collector with only the prometheus receiver, replicating your existing scrape config. Prometheus scrapes the Collector’s /metrics. Nothing else changes.
Confirm series counts match the old path for a few days.
Start adding OTLP receivers and pointing newly-instrumented services at the Collector.
Move heavy processing (filtering, relabeling) out of Prometheus and into Collector processors.
Only then consider switching to remote-write if your topology needs push.

Each step is independently reversible, which is the whole point.

What I’d tell a past me

The Collector is plumbing, and plumbing should be boring. Resist the urge to put clever logic in it on day one. Get the pipeline model right, put memory_limiter first, drop high-cardinality labels at the edge, and alert on the Collector’s own dropped/refused counters. Do that and it becomes the dependable seam between your apps and Prometheus — instead of the mystery box that ate last night’s metrics.

Collector configs vary by version and distribution. Always validate against your own deployment and the official OpenTelemetry docs before rolling to production.

Instrumenting Services With the OpenTelemetry Collector for Prometheus