Instrumenting Services With the OpenTelemetry Collector for Prometheus
The OpenTelemetry Collector is the most useful box in a modern monitoring stack — and the easiest to misconfigure. Here's how to wire it into Prometheus without losing data or your mind.
- #prometheus
- #opentelemetry
- #otel-collector
- #observability
- #metrics
- #sre
For years my metrics pipeline was a pile of bespoke exporters, a scrape config that nobody fully understood, and a quiet prayer that the right /metrics endpoint was reachable. The OpenTelemetry Collector replaced most of that with one process that receives, processes, and exports telemetry in a vendor-neutral way. It’s the most useful box in a modern monitoring stack — and the easiest to misconfigure in a way that silently drops data.
This is how I wire the Collector into Prometheus without surprises.
The mental model: receivers, processors, exporters
The Collector is a pipeline. Data comes in through receivers, passes through processors, and leaves through exporters. You compose those into pipelines by signal type (metrics, traces, logs). Get that model straight and the YAML stops being scary.
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'app'
scrape_interval: 15s
static_configs:
- targets: ['app:8080']
processors:
batch:
timeout: 10s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 25
exporters:
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
service:
pipelines:
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
Notice the Collector can both receive OTLP from instrumented apps and scrape Prometheus targets. That dual role is what lets you migrate a fleet gradually instead of in a big-bang cutover.
Pull vs push: which exporter do I actually want?
There are two ways to get metrics into Prometheus from the Collector, and people pick the wrong one constantly.
prometheusexporter — exposes a/metricsendpoint that Prometheus scrapes. Keeps the pull model. Best when Prometheus already owns service discovery.prometheusremotewriteexporter — pushes to Prometheus’s remote-write endpoint (or Mimir/Thanos Receive). Best when the Collector lives at the edge and central Prometheus can’t reach back.
If you’re not sure, start with the prometheus exporter and keep the pull model. You’ll inherit Prometheus’s staleness handling and up metric for free, which matters more than people expect.
The memory_limiter is not optional
The single most common Collector outage I’ve seen is OOM. A burst of telemetry arrives, the Collector buffers it, memory balloons, the kernel kills the process, and you lose everything in flight. The memory_limiter processor must be first in every pipeline so it can shed load before the box dies:
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 25
Put batch after it. Order matters: limiter first to protect the process, batch last to amortize export overhead.
Transform and drop before you pay for it
The Collector is where you fix cardinality problems instead of paying for them downstream. Drop noisy labels and series at the edge:
processors:
attributes/cleanup:
actions:
- key: http.user_agent
action: delete
- key: pod.uid
action: delete
filter/drop_debug:
metrics:
exclude:
match_type: regexp
metric_names:
- "go_gc.*"
A pod.uid or a raw user_agent becomes a label, and every unique value is a new time series. Deleting them here is far cheaper than chasing the cardinality explosion later. If you’ve ever fought a runaway TSDB, see our notes on taming Prometheus metric cardinality.
Deployment topology: agent plus gateway
For anything past a handful of nodes I run two tiers:
- Agent Collectors as a DaemonSet, one per node, scraping local targets and receiving OTLP from co-located apps. They do cheap work: receive, light filtering, forward.
- Gateway Collectors as a horizontally-scaled Deployment that does the expensive work — tail sampling, heavy transforms, remote-write batching — and exports to Prometheus.
This keeps per-node resource use predictable and centralizes the parts that need a global view. Point agents at the gateway with the otlp exporter over gRPC.
Verify it actually works
Don’t trust YAML — trust the Collector’s own telemetry. It exposes internal metrics about how much it received, dropped, and exported:
# Are we dropping data anywhere?
rate(otelcol_processor_dropped_metric_points[5m]) > 0
# Is remote-write failing?
rate(otelcol_exporter_send_failed_metric_points[5m]) > 0
# Refused at the receiver (backpressure)?
rate(otelcol_receiver_refused_metric_points[5m]) > 0
Alert on all three. A Collector that’s quietly refusing or dropping points is worse than one that’s down, because the dashboards still look green. I keep these on a dedicated panel right next to my monitoring alert routing view so a degraded pipeline is obvious at a glance.
A migration path that doesn’t break anything
You rarely get to rebuild from scratch. The low-risk sequence I use:
- Deploy the Collector with only the
prometheusreceiver, replicating your existing scrape config. Prometheus scrapes the Collector’s/metrics. Nothing else changes. - Confirm series counts match the old path for a few days.
- Start adding OTLP receivers and pointing newly-instrumented services at the Collector.
- Move heavy processing (filtering, relabeling) out of Prometheus and into Collector processors.
- Only then consider switching to remote-write if your topology needs push.
Each step is independently reversible, which is the whole point.
What I’d tell a past me
The Collector is plumbing, and plumbing should be boring. Resist the urge to put clever logic in it on day one. Get the pipeline model right, put memory_limiter first, drop high-cardinality labels at the edge, and alert on the Collector’s own dropped/refused counters. Do that and it becomes the dependable seam between your apps and Prometheus — instead of the mystery box that ate last night’s metrics.
Collector configs vary by version and distribution. Always validate against your own deployment and the official OpenTelemetry docs before rolling to production.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.